<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: nhirschfeld</title><link>https://news.ycombinator.com/user?id=nhirschfeld</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sat, 04 Apr 2026 13:55:07 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=nhirschfeld" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[Show HN: Liter-LLM, Universal LLM client in Rust with bindings for 11 languages]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/kreuzberg-dev/liter-llm">https://github.com/kreuzberg-dev/liter-llm</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47561123">https://news.ycombinator.com/item?id=47561123</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Sun, 29 Mar 2026 07:36:11 +0000</pubDate><link>https://github.com/kreuzberg-dev/liter-llm</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=47561123</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47561123</guid></item><item><title><![CDATA[Show HN: Kreuzberg Comparative Benchmarks]]></title><description><![CDATA[
<p>Article URL: <a href="https://kreuzberg.dev/benchmarks">https://kreuzberg.dev/benchmarks</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46986701">https://news.ycombinator.com/item?id=46986701</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 12 Feb 2026 09:38:50 +0000</pubDate><link>https://kreuzberg.dev/benchmarks</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=46986701</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46986701</guid></item><item><title><![CDATA[Show HN: Kreuzberg v3.0 – Modern Python Document Extraction]]></title><description><![CDATA[
<p>I'm excited to announce Kreuzberg v3.0, which was released yesterday.<p>Kreuzberg is an MIT licensed Python library that extracts text from a wide range of documents (PDFs, images, office files etc.) without depending on external APIs dependencies.<p>Its different from other libraries and commercial offerings in this space by being designed to be (1) lightweight, (2) CPU orientated, (3) simple to user and (4) have async support as a first class citizen.<p>The v3.0 release completely reworks the architecture for extensibility. Kreuzberg now now supports:<p>-  Multiple OCR backends (Tesseract, PaddleOCR, EasyOCR), with OCR itself being completely optional.
-  Support custom extractors and overriding of builtin extractors.
-  Post-processing and validation hooks.
-  Extensive PDF metadata extraction. 
-  Optional support for semantic chunking.<p>There is also a brand new documentation site at <a href="https://goldziher.github.io/kreuzberg" rel="nofollow">https://goldziher.github.io/kreuzberg</a>.<p>I also published a roadmap for the project, which you can see here: <a href="https://github.com/Goldziher/kreuzberg/discussions/24" rel="nofollow">https://github.com/Goldziher/kreuzberg/discussions/24</a><p>You can see the repo at <a href="https://github.com/Goldziher/kreuzberg" rel="nofollow">https://github.com/Goldziher/kreuzberg</a> - please star it if you find it valuable, since this motivates me!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43459261">https://news.ycombinator.com/item?id=43459261</a></p>
<p>Points: 5</p>
<p># Comments: 0</p>
]]></description><pubDate>Mon, 24 Mar 2025 10:24:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=43459261</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43459261</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43459261</guid></item><item><title><![CDATA[Ask HN: Interest in a pgvector-based RAG system library?]]></title><description><![CDATA[
<p>I built a RAG system using pgvector as the backend for local-first vector search. I've already extracted and open-sourced the text extraction component as Kreuzberg (https://github.com/Goldziher/kreuzberg), separate from my main business (https://grantflow.ai).<p>The core system is fairly generic and could work for many use cases with minimal changes. Before investing time in packaging it as a library, I'm curious:<p>- Would the HN community find value in a pgvector-based RAG library?
- What features would be most important to you?
- What belongs in open source vs. commercial offerings?
- What common pitfalls should be avoided?<p>I'd like to gauge if there's actual interest before publishing something nobody will use. So your Feedbacks are most welcome!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43370887">https://news.ycombinator.com/item?id=43370887</a></p>
<p>Points: 3</p>
<p># Comments: 2</p>
]]></description><pubDate>Sat, 15 Mar 2025 08:06:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=43370887</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43370887</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43370887</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>You'll need to use a different OCR engine. Look at easy ocr</p>
]]></description><pubDate>Sun, 16 Feb 2025 07:45:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=43066178</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43066178</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43066178</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>Yes, there have already been several suggestions here for other backend etc.<p>You should try using a different PSM to see if you get better results.<p>If it's scientific texts specifically, look at grobid</p>
]]></description><pubDate>Sun, 16 Feb 2025 07:43:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=43066174</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43066174</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43066174</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>You still need to write it to file to process it via pandoc/tesseract etc.<p>There are alternative options to tesseract ofc.</p>
]]></description><pubDate>Sun, 16 Feb 2025 07:41:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=43066164</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43066164</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43066164</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>thats why Kreuzberg also exposes a sync API for you to consume.</p>
]]></description><pubDate>Sat, 15 Feb 2025 18:23:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=43060796</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43060796</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43060796</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>didnt know this!</p>
]]></description><pubDate>Sat, 15 Feb 2025 18:23:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=43060790</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43060790</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43060790</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>I haven't, testing it out is on my todo list for sure</p>
]]></description><pubDate>Sat, 15 Feb 2025 18:22:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=43060782</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43060782</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43060782</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>I google this for a while...</p>
]]></description><pubDate>Sat, 15 Feb 2025 18:20:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=43060763</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43060763</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43060763</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>I'm actually considering another library with optional API called `Kreuzköln` - probably without the Umlaut!</p>
]]></description><pubDate>Sat, 15 Feb 2025 18:20:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=43060755</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43060755</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43060755</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>Retrieval Augmented Generation. Its a class of techniques for generating content using LLMs. I'd recommend Googling this.</p>
]]></description><pubDate>Sat, 15 Feb 2025 18:19:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=43060742</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43060742</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43060742</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>Thanks for asking!<p>It's both. The OCR part is ofc CPU bound, but the entire text extraction involves reading files, or writing and then reading files.<p>Without async, these simply block.<p>As for efficiency - if you're working in an async application context you have to "asyncify" these operations or suffer the consequences.</p>
]]></description><pubDate>Sat, 15 Feb 2025 16:48:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=43059936</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43059936</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43059936</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>Amazing, would be interested in reading your experience</p>
]]></description><pubDate>Sat, 15 Feb 2025 16:44:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=43059895</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43059895</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43059895</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>Sorry to hear...</p>
]]></description><pubDate>Sat, 15 Feb 2025 14:03:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=43058549</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43058549</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43058549</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>Yup, easy OCR is good.<p>My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.<p>It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.</p>
]]></description><pubDate>Sat, 15 Feb 2025 14:02:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=43058547</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43058547</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43058547</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>interesting!</p>
]]></description><pubDate>Sat, 15 Feb 2025 13:10:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=43058241</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43058241</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43058241</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>lol ;).<p>But seriously, in 13 years living here, only one guy tried to pick pocket me.</p>
]]></description><pubDate>Sat, 15 Feb 2025 13:09:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=43058237</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43058237</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43058237</guid></item><item><title><![CDATA[New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"]]></title><description><![CDATA[
<p>Thanks, I'll check these links.<p>In my tests I found tesseract quite good for regular text documents. For other kinds of texts it's not great.<p>As for using models - there are some good small language models as well, and of course LLMs.<p>I sorta feel though that if one needs complex OCR, or a vision model for layout, one should opt for either a commercial solution that abstracts the deployment and GPU management, or bake ones own system.<p>For most use cases involving text documents though, my subjective opinion is that tesseract is sufficient.</p>
]]></description><pubDate>Sat, 15 Feb 2025 11:46:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=43057827</link><dc:creator>nhirschfeld</dc:creator><comments>https://news.ycombinator.com/item?id=43057827</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43057827</guid></item></channel></rss>