New comment by rapjul in "Show HN: Kreuzberg – Modern async Python library for document text extraction"

rapjul — Sat, 22 Feb 2025 01:04:13 +0000

Docling works quite well for me to convert a scanned book PDF to Markdown text.

On the command line, first install `uv` from https://github.com/astral-sh/uv?tab=readme-ov-file#installat..., then run `uv tool install -U "docling[tesserocr,ocrmac,vlm]"` (first includes the tesserocr, ocrmac (macOS only), and vlm (for running a small Image-to-Text model to get descriptions of images).

You go here https://github.com/DS4SD/docling/blob/main/pyproject.toml#L1... to see all the extra installation options.

For cached/offline use, run `docling-tools models download` to download their models.

Hacker News: rapjul

New comment by rapjul in "Show HN: Kreuzberg – Modern async Python library for document text extraction"