Hacker News: vikp

New comment by vikp in "Mistral OCR 3"

vikp — Sat, 20 Dec 2025 00:26:04 +0000

Hey, I'm the founder of Datalab (we released Chandra OCR). I see someone requested it below - happy to help you all get setup. I'm vik@datalab.to

New comment by vikp in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

vikp — Thu, 18 Dec 2025 22:26:09 +0000

Yes, we can sign a BAA!

New comment by vikp in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

vikp — Thu, 18 Dec 2025 17:31:51 +0000

Hi, I'm a founder of Datalab. I'm not trying to take away from the launch (congrats), just wanted to respond to the specific feedback.

I'm glad you found a solution that worked for you, but this is pretty surprising to hear - our new model, chandra, saturates handwriting-heavy benchmarks like this one - https://www.datalab.to/blog/saturating-the-olmocr-benchmark ,and our production models are more performant than OSS.

Did you test some time ago? We've made a bunch of updates in the last couple of months. Happy to issue some credits if you ever want to try again - vik@datalab.to.

New comment by vikp in "Nanonets-OCR-s – OCR model that transforms documents into structured markdown"

vikp — Wed, 18 Jun 2025 02:41:44 +0000

I assume you're using a PDF, and not the image you shared? You need to set force ocr or format lines to get inline math with a PDF (for images, we just OCR everything anyways, so you don't need any settings).

We're working on improving the playground generally now - expect a big update tomorrow, which among other things will default to format lines.

Thanks for the kind words! The team was just me until pretty recently, but we're growing quickly and will be addressing a lot of issues quickly in the next few weeks.

New comment by vikp in "Nanonets-OCR-s – OCR model that transforms documents into structured markdown"

vikp — Tue, 17 Jun 2025 01:36:30 +0000

Hi, author of marker here - I tried your image, and I don't see the issues you're describing with the newest version of marker (1.7.5).

I ran both with no setting specified, and with force_ocr, and I didn't see the issues either time.

New comment by vikp in "Mistral OCR"

vikp — Fri, 07 Mar 2025 02:42:46 +0000

Thanks for sharing! I'm training some models now that will hopefully improve this and more :)

New comment by vikp in "Mistral OCR"

vikp — Fri, 07 Mar 2025 02:41:23 +0000

Benchmarking is hard for markdown because of the slight formatting variations between different providers. With HTML, you can use something like TEDS (although there are issues with this, too), but with markdown, you don't have a great notion of structure, so you're left with edit distance.

I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.

There are a few different benchmark types in the marker repo:

  - Heuristic (edit distance by block with an ordering score)
  - LLM judging against a rubric
  - LLM win rate (compare two samples from different providers)

None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.

I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.

New comment by vikp in "Mistral OCR"

vikp — Thu, 06 Mar 2025 23:01:56 +0000

I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .

Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.

You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .

The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.

Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

New comment by vikp in "OlmOCR: Open-source tool to extract plain text from PDFs"

vikp — Sat, 01 Mar 2025 00:45:06 +0000

I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (https://github.com/VikParuchuri/marker) is quite flawed.

Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.

Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).

Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - https://huggingface.co/datasets/datalab-to/marker_benchmark_... .

You can see all benchmark code at https://github.com/VikParuchuri/marker/tree/master/benchmark... .

Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.

New comment by vikp in "Ingesting PDFs and why Gemini 2.0 changes everything"

vikp — Thu, 06 Feb 2025 16:34:40 +0000

Docling is a great project, happy to see more people building in the space.

Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:

  - We have hybrid mode with gemini that merges tables across pages, improves quality on forms, etc.
  - we run an ordering model, so ordering is better for docs where the PDF orde ris bad
  - OCR is a lot better, we train our own model, surya - https://github.com/VikParuchuri/surya
  - References and links
  - Better equation conversion (soon including inline)

New comment by vikp in "Ingesting PDFs and why Gemini 2.0 changes everything"

vikp — Wed, 05 Feb 2025 23:01:51 +0000

Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.

I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.

New comment by vikp in "Parsing PDFs (and more) in Elixir using Rust"

vikp — Thu, 30 Jan 2025 00:08:47 +0000

It's on the list to build - been focusing on quality pretty heavily lately.

New comment by vikp in "Parsing PDFs (and more) in Elixir using Rust"

vikp — Wed, 29 Jan 2025 22:51:11 +0000

Hey, I'm the author of marker - thanks for sharing. Most of the processing time is model inference right now. I've been retraining some models lately onto new architectures to improve speed (layout, tables, LaTeX OCR).

We recently integrated gemini flash (via the --use_llm flag), which maybe moves us towards the "hybrid system" you mentioned. Hoping to add support for other APIs soon, but focusing on improving quality/speed now.

Happy to chat if anyone wants to talk about the difficulties of parsing PDFs, or has feedback - email in profile.

New comment by vikp in "Ask HN: Who is hiring? (January 2025)"

vikp — Thu, 02 Jan 2025 17:55:58 +0000

A significant % of useful data is locked away in tough-to-parse formats like PDFs. We build tools to extract it, like https://github.com/VikParuchuri/surya (15k Github stars), and https://github.com/VikParuchuri/marker (19k stars). We also run an inference API and product.

We do meaningful research (we’ve trained several SoTA models), ship product, and contribute to open source. We’re hiring for 2 roles to help us scale:

Senior fullstack software engineer

  - work across our open source repos, inference api, and frontend product
  - interact with our user community
  - you’ll need to be pragmatic, and embrace “boring” technology.  Our stack is fastapi, pytorch, htmx, postgres, and redis.  We deploy to render, and do inference with serverless gpus.
  - requires having built something impressive, ideally an open source project

Head of business operations

  - first non-technical hire
  - work across multiple areas, including finance, hiring, and sales
  - you’ll need to be extremely organized and able to get a lot done
  - requires experience leading operations at an early stage

Email careers@datalab.to if you’re interested - include a link to something you’ve built if possible. You can also read more here - https://datalab-to.notion.site/careers-jan .

New comment by vikp in "Show HN: LLM-aided OCR – Correcting Tesseract OCR errors with LLMs"

vikp — Fri, 09 Aug 2024 19:05:14 +0000

Hi, I'm the author of surya (https://github.com/VikParuchuri/surya) - working on improving speed and accuracy now. Happy to collaborate if you have specific page types it's not working on. For modern/clean documents it benchmarks very similarly to Google Cloud, but working on supporting older documents better now.

New comment by vikp in "Ask HN: What are you using to parse PDFs for RAG?"

vikp — Tue, 30 Jul 2024 23:05:07 +0000

Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.

New comment by vikp in "Ask HN: What are you using to parse PDFs for RAG?"

vikp — Tue, 30 Jul 2024 22:58:46 +0000

Working on improving tables soon (I'm the author of marker)

New comment by vikp in "Show HN: Tarsier – Vision utilities for web interaction agents"

vikp — Wed, 15 May 2024 18:46:21 +0000

This isn't specifically tuned for tables (more for general pdf to markdown), but it's worked for some people with similar use-cases - https://github.com/VikParuchuri/marker

New comment by vikp in "I built an online PDF management platform using open-source software"

vikp — Mon, 13 May 2024 02:45:52 +0000

For PDF to markdown, I recently released V2 of my tool marker - https://github.com/vikparuchuri/marker

New comment by vikp in "Show HN: Beyond text splitting – improved file parsing for LLMs"

vikp — Thu, 11 Apr 2024 15:49:43 +0000

It should be possible to call a GPL library in a separate process (surya can batch process from the CLI) and avoid GPL - ocrmypdf does this with ghostscript.