Hacker News: gergelycsegzi

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Thu, 02 Jul 2026 12:59:31 +0000

Re 1 - that is a very kind offer! Our current public template library is very limited, so let me come back to you on this.

2. We see exactly the same thing. There is a trade-off in correctness vs token burning. However, some tokens (models) are cheaper and faster than others, so the small pieces can benefit from that. The token usage is also surprisingly variable, because it depends on the information density of the document and also on the information density of the question (e.g. is it a single needle in a haystack or are we analyzing the entire haystack from 10 perspectives). So the parsing for 1k pages may be on the order of millions of tokens, while a series of queries (extractions) on top of it could be 1-2 orders of magnitude more.

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Thu, 02 Jul 2026 07:09:32 +0000

Yes, we do it by having multiple stages to the pipeline. First we would extract the independent data points (from say both page 4 and 40) and a second pass step establishes relationship (we call this resolution).

On the scale aspect, because we go in multiple passes, we break the scope into small enough pieces and then build it back up in a later step. Iirc the largest document I've seen a customer use was over 1k pages.

There are more complex data dependency scenarios where we find that the data that's extracted and combined (e.g. from page 4 and 40), needs to then be further transformed in different ways (e.g. having an evaluation and a clarification outcome at the end). To make these be aligned in value we are soon releasing a feature for what we call derived agents.

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Thu, 02 Jul 2026 06:57:50 +0000

This does indeed look really interesting. We have deterministic validations (and some deterministic excel transformations) but using more deterministic transformations for text based on traditional NLP would be a nice complement.

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 23:15:05 +0000

If Claude is good enough for your use case then for sure. If you need scale, persistent structure and verifiability we can help:)

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 22:08:44 +0000

Haha thanks, the reader can try and guess which is which;)

We actually don't use embeddings or vector similarity, since those tend not to work well in specialist domains (e.g. for the OfficeQA benchmark where we have 90k pages talking about US treasury numbers, they would be mostly mapped to a very small embedding space because it's all the same topic, with small variations across years, expense categories etc.).

We use LLMs for the extraction and comparison as well, and we route between different models depending on the complexity of the comprehension of the given step required (and by this I mean routing between our pipeline steps; we currently do not dynamically try to judge individual cases for complexity like OpenRouter Fusion).

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 21:12:58 +0000

I can see why, it's tempting to go for full automation. The reason we go for fine grained sourcing is so that people can build their awareness quickly. Plus many of our customers work in regulated industries where full automation is prohibited.

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 20:29:39 +0000

Potentially, but at that scale cost and latency may actually become an issue, so probably better to consider some sort of indexing or keyword searching.

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 18:35:34 +0000

100% the really hard challenge is that the intermediate representation (ie the parquet equivalent) will be dependent on the given use case. So what we do with the platform is have the users configure the intermediate layer that serves most of their queries, and if they need to extend it we will suggest it for them. For example for the demo on the grounded reasoning benchmark I referred to, here is what the intermediate layer looks like on top of which the agents can more efficiently query: https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 18:32:54 +0000

I'll need to check it out!

We had the same observation in that the possible space is almost endless, and for example even for the same file type there may be different kind of processing required (e.g. an excel can be database style, vs small narrative heavy, or both).

We have baked in some ground processing rules for different kinds of documents, and we do allow custom instructions on how to deal with specific cases (e.g. translations, particular format layouts). The best write-up I have at the moment is https://www.parsewise.ai/doc-processing-pipelines but we're working on something that goes into more detail:)

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 18:14:53 +0000

We were also surprised at first. The reason the models don't do so well is that they need to find information across 90k pages. When they are pointed to the right location they tend to do much better. And with these treasury documents grepping / keyword searching is almost impossible because everything appears thousands of times.

And thank you, we also love the traceability, it's one of the aspects that we have prioritized. Models will never be perfect so rather than building the best model harness we went for the best human harness haha.

Tbh it's been a while since I've looked at notebooklm so I expect it would have gotten better over time. One thing where I found it lacking in the past was the structure we could get out (which gives the traceability) - for example a deep dive on one the underlying data for this corpus: https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...

And yes, we're really excited whenever new open weights models come out that push quality, price, latency. We're finding that throughput is a big obstacle so I'm looking forward to more of this running locally, but it will be a while..

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 17:38:25 +0000

Hey, that's exactly it!

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 17:25:16 +0000

Fully agree, that's why we quite like the Databricks OfficeQA benchmark.. it made us experts on historical US treasuries haha Some screenshots in here: https://www.parsewise.ai/officeqa-sota

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 17:08:54 +0000

Similar to my other comment, we assume that llamaparse and others can provide the individual page OCR. But once you have that the way that you can integrate it into your workflows often requires additional complexity around combining results from different sources. Here is a deeper dive I wrote on the complexities of building extraction pipelines: https://www.parsewise.ai/doc-processing-pipelines

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 17:04:11 +0000

Hey, good point about structure for integrated workflows:)

Fully agree, for enterprises we need to guarantee types, flag discrepancies and provide underlying sources so they can integrate it downstream (whether that's Databricks, n8n etc.)

Here is our documentation for working with a fixed JSON schema: https://docs.parsewise.ai/api#schema-driven-extract-convenie...

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 16:40:17 +0000

In practice we find that each domain (and even each organisation) ends up having highly customized definitions.

At first, fairly generic templated definitions sort of work, but what we've seen is that over time data comes up that is out of distribution, and there was no explicit instruction on how to deal with it. In such cases we tend to flag this and offer suggestions to the users on how they can improve the specificity of agents.

Another structure we have seen play out is having a manager review ratings and feedback comments from their team and updating the definitions accordingly over time (where we offer them the capability to see results of before and after side -by-side for all existing data as well, so they are more confident in the change before committing).

The amount of work is dependent on how good the initial definitions are and how complex the use case is (and how much it evolves - new data sources etc). A bit of an unsatisfying answer but it can be anywhere between a few hours one off or a couple of minutes per day on an ongoing basis.

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 16:01:42 +0000

Haha no appreciate it! That's on me for not calling it out explicitly (was trying to make the video as short as possible), but the demo UIs were literally vibe coded to show the ease of integration https://youtu.be/F1cSuZal03s?si=1H4zTcO-8cosLbVr&t=70

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 15:16:07 +0000

Planning to serve good things for sure, and appreciate your note. Ofc I didn't agree with everything Palantir was doing (also to the extent that we even knew about them at the time). I was working on vaccine distribution and cancer research as well, so definitely felt like helping.

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 15:06:15 +0000

Great question!

1. We are working with the assumption that OCR is (or soon will be) solved at super low prices.

So if we have the extracted data, what can we do with it? Where we see Parsewise making a difference is for use cases that span across documents. I.e. if you are extracting the same 5 fields from every invoice, there are lots of solutions as you listed (+ reducto etc). However, once you have a set of documents (e.g. an entire mortgage application package) and you are trying to get a structured response out, then your option is either an LLM API (if things fit into context and you are okay with limited citations), or building a pipeline with LLMs. I posted it in another comment but an example of trawling through 90k pages is here: https://www.parsewise.ai/officeqa-sota

2. While we rely on LLMs, the outcomes will be non-deterministic, so the bottleneck is and will remain the human verification (that is for somewhat complex use cases). The architecture that we have built is optimizing for the human reviewer to provide as granular values and citations as possible. This is either through our platform, or API clients.

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 14:53:30 +0000

I learnt a lot at Palantir, though always worked in commercial so no ties to security state (for the better or worse). (Also side-note, we are working towards enabling frontier performance with smaller open models that allows our customers to protect their data. https://www.parsewise.ai/officeqa-sota )

And I do get genuine joy from helping our users, so love it is:)

New comment by gergelycsegzi in "Launch HN: Parsewise (YC P25) – Reason Across Documents with an API"

gergelycsegzi — Wed, 01 Jul 2026 14:46:58 +0000

"That is a great catch!"