Hacker News: ritvikpandey21

New comment by ritvikpandey21 in "[dead]"

ritvikpandey21 — Thu, 23 Apr 2026 17:01:54 +0000

We've been building table extraction at Pulse and evaluated four benchmarks: OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench. None of them fully reflect the enterprise document workflows we've encountered in production.

TEDS (OmniDocBench) penalizes HTML formatting differences that don't affect the actual table, so the same 3x3 grid scores differently depending on whether headers use vs , and the benchmark only covers English and Chinese plus a small mixed category.

SCORE-Bench's spatial tolerance parameter can mask real failures, because if you drop a header row and shift all data up by one with delta=1, the benchmark reports high accuracy even though the column labels are gone.

ParseBench generates its ground truth with frontier VLMs (Claude Opus for tables), which introduces hallucination risk, and its TableRecordMatch metric treats tables as unordered bags of key-value records, so it doesn't penalize column transposition or row reordering. The table set is also 503 pages, English-only, with over half from a single source.

RD-TableBench linearizes tables into 1D sequences, losing horizontal vs vertical adjacency. The RD-TableBench ground truth audit is what concerned us most. We went through all 1,000 ground truth files against the source images, and the errors consisted of scrambled text and wrong structure, garbled OCR on CJK and Arabic, and buffer artifacts where random digit sequences got appended to real numeric values. Dozens of ground truth files are byte-for-byte identical to one provider's output, and in a subset of the error cases the ground truth and that provider share the exact same specific error (same wrong word order in headers, same watermark text pulled into cells, same garbled CJK characters) while independent providers don't produce those errors.

This also motivated us to build PulseBench-Tab, a benchmark of 1,820 human-annotated tables across 9 languages and 4 scripts, with graph-based evaluation via T-LAG that operates on the parsed grid rather than the DOM tree, and fully open ground truth, scoring code, and provider outputs. Arabic and Korean both show 75+ point spreads across providers, and everything is available on HuggingFace and GitHub.

PulseBench-Tab: Open-source, multilingual benchmark for table extraction

ritvikpandey21 — Wed, 22 Apr 2026 17:40:58 +0000

Article URL: https://www.runpulse.com/blog/pulsebench-tab

Comments URL: https://news.ycombinator.com/item?id=47866776

Points: 5

# Comments: 1

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 19:54:45 +0000

Results look pretty good (with the exception of one very faint page) - check it out here! https://platform.runpulse.com/dashboard/extractions/public/f...

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 18:23:04 +0000

thanks! we benchmark against all the major players (azure doc intelligence, aws textract, google doc ai, frontier llms, etc). we have some public news coming out soon on this front, but we have a very rigorous dataset using both public and synthetic data focusing on the hardest problems in the space (handwriting, tables, etc).

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 18:20:52 +0000

yeah models are definitely improving, but we've found even the latest ones still hallucinate and infer text rather than doing pure transcription. we carry out very rigorous benchmarks against all of the frontier models. we think the differentiation is in accuracy on truly messy docs (nested tables, degraded scans, handwriting) and being able to deploy on-prem/vpc for regulated industries.

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 18:20:24 +0000

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 18:18:19 +0000

thanks for the flag! have pointed this out will be pushing an update here shortly

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 17:28:40 +0000

we disagree! we've found llms by themselves aren't enough and suffer from pretty big failure modes like hallucination and inferring text rather than pure transcription. we wrote a blog about this [1]. the right approach so far seems to be a hybrid workflow that uses very specific parts of the language model architecture.

[1] https://www.runpulse.com/blog/why-llms-suck-at-ocr

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 17:26:01 +0000

thanks! appreciate the kind words

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 17:25:50 +0000

our team has tested docling pretty extensively, works well for simpler text-heavy docs without complex layouts, but the moment you introduce tables or multi-column stuff it doesn't maintain layout well.

New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"

ritvikpandey21 — Thu, 18 Dec 2025 17:23:37 +0000

we're more focused on the core extraction layer itself rather than workflow tooling. we train our own vision models for layout detection, ocr, and table parsing from scratch. the key thing for us is determinism and auditability, so outputs are reproducible run over run, which matters a lot for regulated enterprises.

New comment by ritvikpandey21 in "[dead]"

ritvikpandey21 — Mon, 20 Oct 2025 19:51:54 +0000

DeepSeek AI just released DeepSeek-OCR, a new open-source model that aims to rethink text extraction through what it calls Context Optical Compression. The launch quickly caught attention on X and GitHub, with many celebrating another big step in open document AI.

At Pulse, we were curious how it performs on the kinds of messy, high-density documents that power real business workflows. So we ran DeepSeek-OCR through our standard evaluation suite: multi-page PDFs, handwritten forms, nested tables, and scanned statements. The results were promising in theory but inconsistent in practice.

New comment by ritvikpandey21 in "AInertia – The Adoption Problem Outside the Bubble"

ritvikpandey21 — Thu, 04 Sep 2025 14:34:29 +0000

interesting read

New comment by ritvikpandey21 in "[dead]"

ritvikpandey21 — Tue, 12 Aug 2025 15:36:24 +0000

We processed hundreds of millions of pages and found that a single accuracy metric is misleading. A model that's 98% accurate on 1,000 pages with 200 data elements each still produces 4,000 incorrect values. The real killers are broken reading order in multi-column layouts, shifted table columns, and lost cross-page context that silently corrupt datasets without throwing errors.

New comment by ritvikpandey21 in "[dead]"

ritvikpandey21 — Tue, 24 Jun 2025 14:52:50 +0000

We evaluated ByteDance's Dolphin document parsing model on enterprise document processing tasks using standardized benchmarks and real-world document sets. Our testing dataset included 847 financial documents, 312 legal forms, and 156 academia research publications to assess performance across critical enterprise use cases.

New comment by ritvikpandey21 in "[dead]"

ritvikpandey21 — Tue, 27 May 2025 13:13:14 +0000

After processing nearly 500 million pages of enterprise documents, we've discovered that the biggest challenge in document AI isn't character recognition or table extraction. It's something far more fundamental: understanding how information flows across page breaks, column boundaries, and interrupted sections.

Why Semantic Understanding Breaks at Page Boundaries

ritvikpandey21 — Tue, 27 May 2025 13:13:13 +0000

Article URL: https://www.runpulse.com/blog/the-document-continuity-problem

Comments URL: https://news.ycombinator.com/item?id=44106650

Points: 2

# Comments: 1

Legacy OCR Tools Are Failing the Legal Industry: Here's Why

ritvikpandey21 — Mon, 17 Mar 2025 16:07:48 +0000

Article URL: https://www.runpulse.com/blog/legacy-ocr-tools-are-failing-the-legal-industry-heres-why

Comments URL: https://news.ycombinator.com/item?id=43389976

Points: 2

# Comments: 0

New comment by ritvikpandey21 in "Launch HN: Sift Dev (YC W25) – AI-Powered Datadog Alternative"

ritvikpandey21 — Wed, 12 Mar 2025 03:06:26 +0000

curious how LLM hallucinations will work on logging info - gonna be a hard problem to solve

New comment by ritvikpandey21 in "Mistral OCR"

ritvikpandey21 — Fri, 07 Mar 2025 00:11:43 +0000

as builders in this space, we decided to put it to the test on complex nested tables, pie charts, etc. to see if the same VLM hallucination issues persist, and to what degree. while results were promising, we found several critical failure nodes across two document domains.

check out our blog post here! https://www.runpulse.com/blog/beyond-the-hype-real-world-tes...