<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: gergelycsegzi</title><link>https://news.ycombinator.com/user?id=gergelycsegzi</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 03 Jul 2026 05:54:46 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=gergelycsegzi" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Re 1 - that is a very kind offer! Our current public template library is very limited, so let me come back to you on this.<p>2. We see exactly the same thing. There is a trade-off in correctness vs token burning. However, some tokens (models) are cheaper and faster than others, so the small pieces can benefit from that. The token usage is also surprisingly variable, because it depends on the information density of the document and also on the information density of the question (e.g. is it a single needle in a haystack or are we analyzing the entire haystack from 10 perspectives). So the parsing for 1k pages may be on the order of millions of tokens, while a series of queries (extractions) on top of it could be 1-2 orders of magnitude more.</p>
]]></description><pubDate>Thu, 02 Jul 2026 12:59:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=48760851</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48760851</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48760851</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Yes, we do it by having multiple stages to the pipeline. First we would extract the independent data points (from say both page 4 and 40) and a second pass step establishes relationship (we call this resolution).<p>On the scale aspect, because we go in multiple passes, we break the scope into small enough pieces and then build it back up in a later step. Iirc the largest document I've seen a customer use was over 1k pages.<p>There are more complex data dependency scenarios where we find that the data that's extracted and combined (e.g. from page 4 and 40), needs to then be further transformed in different ways (e.g. having an evaluation and a clarification outcome at the end). To make these be aligned in value we are soon releasing a feature for what we call derived agents.</p>
]]></description><pubDate>Thu, 02 Jul 2026 07:09:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=48757623</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48757623</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48757623</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>This does indeed look really interesting. We have deterministic validations (and some deterministic excel transformations) but using more deterministic transformations for text based on traditional NLP would be a nice complement.</p>
]]></description><pubDate>Thu, 02 Jul 2026 06:57:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=48757534</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48757534</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48757534</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>If Claude is good enough for your use case then for sure. If you need scale, persistent structure and verifiability we can help:)</p>
]]></description><pubDate>Wed, 01 Jul 2026 23:15:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=48754382</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48754382</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48754382</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Haha thanks, the reader can try and guess which is which;)<p>We actually don't use embeddings or vector similarity, since those tend not to work well in specialist domains (e.g. for the OfficeQA benchmark where we have 90k pages talking about US treasury numbers, they would be mostly mapped to a very small embedding space because it's all the same topic, with small variations across years, expense categories etc.).<p>We use LLMs for the extraction and comparison as well, and we route between different models depending on the complexity of the comprehension of the given step required (and by this I mean routing between our pipeline steps; we currently do not dynamically try to judge individual cases for complexity like OpenRouter Fusion).</p>
]]></description><pubDate>Wed, 01 Jul 2026 22:08:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=48753778</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48753778</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48753778</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>I can see why, it's tempting to go for full automation. The reason we go for fine grained sourcing is so that people can build their awareness quickly. Plus many of our customers work in regulated industries where full automation is prohibited.</p>
]]></description><pubDate>Wed, 01 Jul 2026 21:12:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=48753186</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48753186</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48753186</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Potentially, but at that scale cost and latency may actually become an issue, so probably better to consider some sort of indexing or keyword searching.</p>
]]></description><pubDate>Wed, 01 Jul 2026 20:29:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=48752690</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48752690</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48752690</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>100% the really hard challenge is that the intermediate representation (ie the parquet equivalent) will be dependent on the given use case. So what we do with the platform is have the users configure the intermediate layer that serves most of their queries, and if they need to extend it we will suggest it for them. For example for the demo on the grounded reasoning benchmark I referred to, here is what the intermediate layer looks like on top of which the agents can more efficiently query: <a href="https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-ed44ec18278f/results?view=table&t_f=164d-3a4b&t_s=164d-3a4b%3A50">https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...</a></p>
]]></description><pubDate>Wed, 01 Jul 2026 18:35:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=48751327</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48751327</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48751327</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>I'll need to check it out!<p>We had the same observation in that the possible space is almost endless, and for example even for the same file type there may be different kind of processing required (e.g. an excel can be database style, vs small narrative heavy, or both).<p>We have baked in some ground processing rules for different kinds of documents, and we do allow custom instructions on how to deal with specific cases (e.g. translations, particular format layouts). The best write-up I have at the moment is <a href="https://www.parsewise.ai/doc-processing-pipelines">https://www.parsewise.ai/doc-processing-pipelines</a> but we're working on something that goes into more detail:)</p>
]]></description><pubDate>Wed, 01 Jul 2026 18:32:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=48751285</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48751285</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48751285</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>We were also surprised at first. The reason the models don't do so well is that they need to find information across 90k pages. When they are pointed to the right location they tend to do much better. And with these treasury documents grepping / keyword searching is almost impossible because everything appears thousands of times.<p>And thank you, we also love the traceability, it's one of the aspects that we have prioritized. Models will never be perfect so rather than building the best model harness we went for the best human harness haha.<p>Tbh it's been a while since I've looked at notebooklm so I expect it would have gotten better over time. One thing where I found it lacking in the past was the structure we could get out (which gives the traceability) - for example a deep dive on one the underlying data for this corpus: <a href="https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-ed44ec18278f/results?view=table&t_f=164d-3a4b&t_s=164d-3a4b%3A50">https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...</a><p>And yes, we're really excited whenever new open weights models come out that push quality, price, latency. We're finding that throughput is a big obstacle so I'm looking forward to more of this running locally, but it will be a while..</p>
]]></description><pubDate>Wed, 01 Jul 2026 18:14:53 +0000</pubDate><link>https://news.ycombinator.com/item?id=48751018</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48751018</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48751018</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Hey, that's exactly it!</p>
]]></description><pubDate>Wed, 01 Jul 2026 17:38:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=48750490</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48750490</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48750490</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Fully agree, that's why we quite like the Databricks OfficeQA benchmark.. it made us experts on historical US treasuries haha
Some screenshots in here: <a href="https://www.parsewise.ai/officeqa-sota">https://www.parsewise.ai/officeqa-sota</a></p>
]]></description><pubDate>Wed, 01 Jul 2026 17:25:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=48750309</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48750309</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48750309</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Similar to my other comment, we assume that llamaparse and others can provide the individual page OCR. But once you have that the way that you can integrate it into your workflows often requires additional complexity around combining results from different sources. Here is a deeper dive I wrote on the complexities of building extraction pipelines: <a href="https://www.parsewise.ai/doc-processing-pipelines">https://www.parsewise.ai/doc-processing-pipelines</a></p>
]]></description><pubDate>Wed, 01 Jul 2026 17:08:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=48750064</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48750064</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48750064</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Hey, good point about structure for integrated workflows:)<p>Fully agree, for enterprises we need to guarantee types, flag discrepancies and provide underlying sources so they can integrate it downstream (whether that's Databricks, n8n etc.)<p>Here is our documentation for working with a fixed JSON schema: <a href="https://docs.parsewise.ai/api#schema-driven-extract-convenience-endpoint">https://docs.parsewise.ai/api#schema-driven-extract-convenie...</a></p>
]]></description><pubDate>Wed, 01 Jul 2026 17:04:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=48749984</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48749984</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48749984</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>In practice we find that each domain (and even each organisation) ends up having highly customized definitions.<p>At first, fairly generic templated definitions sort of work, but what we've seen is that over time data comes up that is out of distribution, and there was no explicit instruction on how to deal with it. In such cases we tend to flag this and offer suggestions to the users on how they can improve the specificity of agents.<p>Another structure we have seen play out is having a manager review ratings and feedback comments from their team and updating the definitions accordingly over time (where we offer them the capability to see results of before and after side -by-side for all existing data as well, so they are more confident in the change before committing).<p>The amount of work is dependent on how good the initial definitions are and how complex the use case is (and how much it evolves - new data sources etc). A bit of an unsatisfying answer but it can be anywhere between a few hours one off or a couple of minutes per day on an ongoing basis.</p>
]]></description><pubDate>Wed, 01 Jul 2026 16:40:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=48749608</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48749608</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48749608</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Haha no appreciate it! That's on me for not calling it out explicitly (was trying to make the video as short as possible), but the demo UIs were literally vibe coded to show the ease of integration <a href="https://youtu.be/F1cSuZal03s?si=1H4zTcO-8cosLbVr&t=70" rel="nofollow">https://youtu.be/F1cSuZal03s?si=1H4zTcO-8cosLbVr&t=70</a></p>
]]></description><pubDate>Wed, 01 Jul 2026 16:01:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=48749021</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48749021</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48749021</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Planning to serve good things for sure, and appreciate your note.
Ofc I didn't agree with everything Palantir was doing (also to the extent that we even knew about them at the time). I was working on vaccine distribution and cancer research as well, so definitely felt like helping.</p>
]]></description><pubDate>Wed, 01 Jul 2026 15:16:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=48748259</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48748259</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48748259</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>Great question!<p>1. We are working with the assumption that OCR is (or soon will be) solved at super low prices.<p>So if we have the extracted data, what can we do with it?
Where we see Parsewise making a difference is for use cases that span across documents.
I.e. if you are extracting the same 5 fields from every invoice, there are lots of solutions as you listed (+ reducto etc). However, once you have a set of documents (e.g. an entire mortgage application package) and you are trying to get a structured response out, then your option is either an LLM API (if things fit into context and you are okay with limited citations), or building a pipeline with LLMs. I posted it in another comment but an example of trawling through 90k pages is here: <a href="https://www.parsewise.ai/officeqa-sota">https://www.parsewise.ai/officeqa-sota</a><p>2. While we rely on LLMs, the outcomes will be non-deterministic, so the bottleneck is and will remain the human verification (that is for somewhat complex use cases). The architecture that we have built is optimizing for the human reviewer to provide as granular values and citations as possible. This is either through our platform, or API clients.</p>
]]></description><pubDate>Wed, 01 Jul 2026 15:06:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=48748076</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48748076</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48748076</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>I learnt a lot at Palantir, though always worked in commercial so no ties to security state (for the better or worse).
(Also side-note, we are working towards enabling frontier performance with smaller open models that allows our customers to protect their data. <a href="https://www.parsewise.ai/officeqa-sota">https://www.parsewise.ai/officeqa-sota</a> )<p>And I do get genuine joy from helping our users, so love it is:)</p>
]]></description><pubDate>Wed, 01 Jul 2026 14:53:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=48747853</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48747853</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48747853</guid></item><item><title><![CDATA[New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"]]></title><description><![CDATA[
<p>"That is a great catch!"</p>
]]></description><pubDate>Wed, 01 Jul 2026 14:46:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=48747731</link><dc:creator>gergelycsegzi</dc:creator><comments>https://news.ycombinator.com/item?id=48747731</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48747731</guid></item></channel></rss>