Hacker News: freezed8

Show HN: LiteParse, a fast open-source document parser for AI agents

freezed8 — Fri, 20 Mar 2026 16:43:09 +0000

LiteParse is an open-source (Apache 2.0) document parser that provides high-quality spatial text parsing with bounding boxes. It does not depend on local or frontier VLMs.

Because it does not require GPUs, liteparse can be run on any machine, and process a few hundred pages of documents in seconds. It offers higher accuracy than similar tools like PyPDF, PyMuPDF, MarkItDown.

It supports a variety of file formats - PDFs, Office documents, images. It can be one-line installed as a skill for 40+ different AI agents, including Claude Code, Cursor, OpenClaw, Windsurf, and more.

Comments URL: https://news.ycombinator.com/item?id=47457128

Points: 12

# Comments: 0

New comment by freezed8 in "Don't bother parsing: Just use images for RAG"

freezed8 — Tue, 22 Jul 2025 16:24:37 +0000

This blog post makes some good points about using vision models for retrieval, but I do want to call out a few problems: 1. The blog conflates indexing/retrieval with document parsing. Document parsing itself is the task of converting a document into a structured text representation, whether it's markdown/JSON (or in the case of extraction, an output that conforms to a schema). It has many uses, one of which is RAG, but many of which are not necessarily RAG related.

ColPali is great for retrieval, but you can't use ColPali (at least natively) for pure document parsing tasks. There's a lot of separate benchmarks for just evaluating doc parsing while the author mostly talks about visual retrieval benchmarks.

2. This whole idea of "You can DIY document parsing by screenshotting a page" is not new at all, lots of people have been talking about it! It's certainly fine as a baseline and does work better than standard OCR in many cases.

a. But from our experience there's still a long-tail of accuracy issues. b. It's missing metadata like confidence scores/bounding boxes etc. out of the box c. Honestly this is underrated, but creating a good screenshotting pipeline itself is non-trivial.

3. In general for retrieval, it's helpful to have both text and image representations. Image tokens are obviously much more powerful. Text tokens are way cheaper to store and let you do things like retrieval entire documents (instead of just chunks) and input that into the LLM.

(disclaimer: I am ceo of llamaindex, and we have worked on both document parsing and retrieval with LlamaCloud, but I hope my point stands in a general sense)

New comment by freezed8 in "Ask HN: Spreadsheet LLM Understanding"

freezed8 — Thu, 26 Jun 2025 00:14:55 +0000

hi! (i'm jerry ceo of llamaindex)

would be happy to chat and show you our excel agent capabilities, feel free to send us a message at support@runllama.ai.

financial statements like P&L are actually our sweet spot atm so this sounds like a good fit

New comment by freezed8 in "Ingesting PDFs and why Gemini 2.0 changes everything"

freezed8 — Thu, 06 Feb 2025 00:51:18 +0000

yeah disable OCR is for documents where you don't need to OCR a scanned image, it'll just parse out the text

it's faster if True

New comment by freezed8 in "Ingesting PDFs and why Gemini 2.0 changes everything"

freezed8 — Thu, 06 Feb 2025 00:19:35 +0000

yes! we have foreign language support for better OCR on scans. Here's some more details. Docs: https://docs.cloud.llamaindex.ai/llamaparse/features/parsing... Notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...

New comment by freezed8 in "Ingesting PDFs and why Gemini 2.0 changes everything"

freezed8 — Wed, 05 Feb 2025 23:50:24 +0000

(disclaimer I am CEO of llamaindex, which includes LlamaParse)

Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so well) is to always use and stay on top of the latest SOTA models and tech :) - we blend LLM/VLM tech with best-in-class heuristic techniques.

Some quick notes: 1. I'm glad that LlamaParse is mentioned in the article, but it's not mentioned in the performance benchmarks. I'm pretty confident that our most accurate modes are at the top of the table benchmark - our stuff is pretty good.

2. There's a long tail of issues beyond just tables - this includes fonts, headers/footers, ability to recognize charts/images/form fields, and as other posters said, the ability to have fine-grained bounding boxes on the source elements. We've optimized our parser to tackle all of these modes, and we need proper benchmarks for that.

3. DIY'ing your own pipeline to run a VLM at scale to parse docs is surprisingly challenging. You need to orchestrate a robust system that can screenshot a bunch of pages at the right resolution (which can be quite slow), tune the prompts, and make sure you're obeying rate limits + can retry on failure.

New comment by freezed8 in "Why we no longer use LangChain for building our AI agents"

freezed8 — Fri, 21 Jun 2024 18:51:39 +0000

(jerry here from llamaindex)

wait do you have specific examples of "overengineering and overabstracting" from llamaindex? very open to feedback and suggestions on improvement - we've spent a lot of work making sure everything is customizable

New comment by freezed8 in "LlamaCloud and LlamaParse"

freezed8 — Wed, 21 Feb 2024 08:09:49 +0000

(jerry here)

Thanks for running through the benchmark! Just to clarify some things: (1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.

(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?

New comment by freezed8 in "LlamaIndex raises $8.5M seed round, led by Greylock Partners"

freezed8 — Tue, 06 Jun 2023 16:36:07 +0000

Hi all - Jerry (co-founder/CEO) here, here to help answer any questions you might have!

We're building a data framework to unlock the full capabilities of LLMs on top of your private data. We can’t wait for the future - this space is moving so rapidly and there’s so many things we want to do on both the open-source and enterprise side.

Feel free to shoot me a personal note on Twitter/Discord as well.

LlamaIndex raises $8.5M seed round, led by Greylock Partners

freezed8 — Tue, 06 Jun 2023 16:03:21 +0000

Article URL: https://medium.com/llamaindex-blog/building-the-data-framework-for-llms-bca068e89e0e

Comments URL: https://news.ycombinator.com/item?id=36214843

Points: 33

# Comments: 17