Hacker News: nhirschfeld

New comment by nhirschfeld in "Show HN: Kreuzberg Cloud – ultra fast content intelligence – in public beta"

nhirschfeld — Wed, 20 May 2026 09:24:37 +0000

Hi HN!

I'm the maintainer of Kreuzberg, an open-source document intelligence library (https://github.com/kreuzberg-dev/kreuzberg). Some of you may have used it for RAG ingestion.

We're launching Kreuzberg Cloud, a SAAS API and a self-hosted system. It's in public beta, and I would like to invite you all to give it a try.

What out MVP offers: we offer very fast CPU optimized document and code intelligence. You can extract content from more than 90 document file formats and 300 code file formats into Markdown (or plaintext/djot), with additional features (same pricing tier) including chunking, embeddings, keyword extraction - and various types of intelligence.

The OSS library is used as the base engine of the cloud system. Our initial offering is $0.008/page, and you get the first 10K pages free, no card required.

We also offer our entire system for self-hosting - using helm charts. We are looking for design partners, so if thats relevant - shoot me a line.

Show HN: Kreuzberg Cloud – ultra fast content intelligence – in public beta

nhirschfeld — Wed, 20 May 2026 09:24:29 +0000

Article URL: https://kreuzberg.dev

Comments URL: https://news.ycombinator.com/item?id=48205129

Points: 5

# Comments: 4

Show HN: Liter-LLM, Universal LLM client in Rust with bindings for 11 languages

nhirschfeld — Sun, 29 Mar 2026 07:36:11 +0000

Article URL: https://github.com/kreuzberg-dev/liter-llm

Comments URL: https://news.ycombinator.com/item?id=47561123

Points: 2

# Comments: 0

Show HN: Kreuzberg Comparative Benchmarks

nhirschfeld — Thu, 12 Feb 2026 09:38:50 +0000

Article URL: https://kreuzberg.dev/benchmarks

Comments URL: https://news.ycombinator.com/item?id=46986701

Points: 1

# Comments: 0

Show HN: Kreuzberg v3.0 – Modern Python Document Extraction

nhirschfeld — Mon, 24 Mar 2025 10:24:32 +0000

I'm excited to announce Kreuzberg v3.0, which was released yesterday.

Kreuzberg is an MIT licensed Python library that extracts text from a wide range of documents (PDFs, images, office files etc.) without depending on external APIs dependencies.

Its different from other libraries and commercial offerings in this space by being designed to be (1) lightweight, (2) CPU orientated, (3) simple to user and (4) have async support as a first class citizen.

The v3.0 release completely reworks the architecture for extensibility. Kreuzberg now now supports:

- Multiple OCR backends (Tesseract, PaddleOCR, EasyOCR), with OCR itself being completely optional. - Support custom extractors and overriding of builtin extractors. - Post-processing and validation hooks. - Extensive PDF metadata extraction. - Optional support for semantic chunking.

There is also a brand new documentation site at https://goldziher.github.io/kreuzberg.

I also published a roadmap for the project, which you can see here: https://github.com/Goldziher/kreuzberg/discussions/24

You can see the repo at https://github.com/Goldziher/kreuzberg - please star it if you find it valuable, since this motivates me!

Comments URL: https://news.ycombinator.com/item?id=43459261

Points: 5

# Comments: 0

Ask HN: Interest in a pgvector-based RAG system library?

nhirschfeld — Sat, 15 Mar 2025 08:06:21 +0000

I built a RAG system using pgvector as the backend for local-first vector search. I've already extracted and open-sourced the text extraction component as Kreuzberg (https://github.com/Goldziher/kreuzberg), separate from my main business (https://grantflow.ai).

The core system is fairly generic and could work for many use cases with minimal changes. Before investing time in packaging it as a library, I'm curious:

- Would the HN community find value in a pgvector-based RAG library? - What features would be most important to you? - What belongs in open source vs. commercial offerings? - What common pitfalls should be avoided?

I'd like to gauge if there's actual interest before publishing something nobody will use. So your Feedbacks are most welcome!

Comments URL: https://news.ycombinator.com/item?id=43370887

Points: 3

# Comments: 2

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sun, 16 Feb 2025 07:45:01 +0000

You'll need to use a different OCR engine. Look at easy ocr

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sun, 16 Feb 2025 07:43:55 +0000

Yes, there have already been several suggestions here for other backend etc.

You should try using a different PSM to see if you get better results.

If it's scientific texts specifically, look at grobid

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sun, 16 Feb 2025 07:41:00 +0000

You still need to write it to file to process it via pandoc/tesseract etc.

There are alternative options to tesseract ofc.

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 18:23:59 +0000

thats why Kreuzberg also exposes a sync API for you to consume.

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 18:23:14 +0000

didnt know this!

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 18:22:20 +0000

I haven't, testing it out is on my todo list for sure

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 18:20:58 +0000

I google this for a while...

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 18:20:10 +0000

I'm actually considering another library with optional API called `Kreuzköln` - probably without the Umlaut!

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 18:19:21 +0000

Retrieval Augmented Generation. Its a class of techniques for generating content using LLMs. I'd recommend Googling this.

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 16:48:44 +0000

Thanks for asking!

It's both. The OCR part is ofc CPU bound, but the entire text extraction involves reading files, or writing and then reading files.

Without async, these simply block.

As for efficiency - if you're working in an async application context you have to "asyncify" these operations or suffer the consequences.

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 16:44:38 +0000

Amazing, would be interested in reading your experience

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 14:03:09 +0000

Sorry to hear...

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 14:02:41 +0000

Yup, easy OCR is good.

My reasons for using Tesseract - easy OCR is larger, and it has a significant cold start.

It benchmarks better for many OCR tasks though, so I'm thinking of adding it as an alternative backend.

New comment by nhirschfeld in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

nhirschfeld — Sat, 15 Feb 2025 13:10:28 +0000

interesting!