<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: ritvikpandey21</title><link>https://news.ycombinator.com/user?id=ritvikpandey21</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 22 Jun 2026 02:26:58 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=ritvikpandey21" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by ritvikpandey21 in "[dead]"]]></title><description><![CDATA[
<p>We've been building table extraction at Pulse and evaluated four benchmarks: OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench. None of them fully reflect the enterprise document workflows we've encountered in production.<p>TEDS (OmniDocBench) penalizes HTML formatting differences that don't affect the actual table, so the same 3x3 grid scores differently depending on whether headers use <thead> vs <tr>, and the benchmark only covers English and Chinese plus a small mixed category.<p>SCORE-Bench's spatial tolerance parameter can mask real failures, because if you drop a header row and shift all data up by one with delta=1, the benchmark reports high accuracy even though the column labels are gone.<p>ParseBench generates its ground truth with frontier VLMs (Claude Opus for tables), which introduces hallucination risk, and its TableRecordMatch metric treats tables as unordered bags of key-value records, so it doesn't penalize column transposition or row reordering. The table set is also 503 pages, English-only, with over half from a single source.<p>RD-TableBench linearizes tables into 1D sequences, losing horizontal vs vertical adjacency.
The RD-TableBench ground truth audit is what concerned us most. We went through all 1,000 ground truth files against the source images, and the errors consisted of scrambled text and wrong structure, garbled OCR on CJK and Arabic, and buffer artifacts where random digit sequences got appended to real numeric values. Dozens of ground truth files are byte-for-byte identical to one provider's output, and in a subset of the error cases the ground truth and that provider share the exact same specific error (same wrong word order in headers, same watermark text pulled into cells, same garbled CJK characters) while independent providers don't produce those errors.<p>This also motivated us to build PulseBench-Tab, a benchmark of 1,820 human-annotated tables across 9 languages and 4 scripts, with graph-based evaluation via T-LAG that operates on the parsed grid rather than the DOM tree, and fully open ground truth, scoring code, and provider outputs. Arabic and Korean both show 75+ point spreads across providers, and everything is available on HuggingFace and GitHub.</p>
]]></description><pubDate>Thu, 23 Apr 2026 17:01:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=47878223</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=47878223</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47878223</guid></item><item><title><![CDATA[PulseBench-Tab: Open-source, multilingual benchmark for table extraction]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.runpulse.com/blog/pulsebench-tab">https://www.runpulse.com/blog/pulsebench-tab</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47866776">https://news.ycombinator.com/item?id=47866776</a></p>
<p>Points: 5</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 22 Apr 2026 17:40:58 +0000</pubDate><link>https://www.runpulse.com/blog/pulsebench-tab</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=47866776</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47866776</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>Results look pretty good (with the exception of one very faint page) - check it out here! <a href="https://platform.runpulse.com/dashboard/extractions/public/f51c25bf-3c2a-4176-9a6d-f8381e82ae09">https://platform.runpulse.com/dashboard/extractions/public/f...</a></p>
]]></description><pubDate>Thu, 18 Dec 2025 19:54:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=46317749</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46317749</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46317749</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>thanks! we benchmark against all the major players (azure doc intelligence, aws textract, google doc ai, frontier llms, etc). we have some public news coming out soon on this front, but we have a very rigorous dataset using both public and synthetic data focusing on the hardest problems in the space (handwriting, tables, etc).</p>
]]></description><pubDate>Thu, 18 Dec 2025 18:23:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=46316481</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46316481</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46316481</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>yeah models are definitely improving, but we've found even the latest ones still hallucinate and infer text rather than doing pure transcription. we carry out very rigorous benchmarks against all of the frontier models. we think the differentiation is in accuracy on truly messy docs (nested tables, degraded scans, handwriting) and being able to deploy on-prem/vpc for regulated industries.</p>
]]></description><pubDate>Thu, 18 Dec 2025 18:20:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=46316449</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46316449</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46316449</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>yeah models are definitely improving, but we've found even the latest ones still hallucinate and infer text rather than doing pure transcription. we carry out very rigorous benchmarks against all of the frontier models. we think the differentiation is in accuracy on truly messy docs (nested tables, degraded scans, handwriting) and being able to deploy on-prem/vpc for regulated industries.</p>
]]></description><pubDate>Thu, 18 Dec 2025 18:20:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=46316439</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46316439</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46316439</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>thanks for the flag! have pointed this out will be pushing an update here shortly</p>
]]></description><pubDate>Thu, 18 Dec 2025 18:18:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=46316404</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46316404</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46316404</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>we disagree! we've found llms by themselves aren't enough and suffer from pretty big failure modes like hallucination and inferring text rather than pure transcription. we wrote a blog about this [1]. the right approach so far seems to be a hybrid workflow that uses very specific parts of the language model architecture.<p>[1] <a href="https://www.runpulse.com/blog/why-llms-suck-at-ocr">https://www.runpulse.com/blog/why-llms-suck-at-ocr</a></p>
]]></description><pubDate>Thu, 18 Dec 2025 17:28:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=46315769</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46315769</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46315769</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>thanks! appreciate the kind words</p>
]]></description><pubDate>Thu, 18 Dec 2025 17:26:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=46315730</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46315730</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46315730</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>our team has tested docling pretty extensively, works well for simpler text-heavy docs without complex layouts, but the moment you introduce tables or multi-column stuff it doesn't maintain layout well.</p>
]]></description><pubDate>Thu, 18 Dec 2025 17:25:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=46315728</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46315728</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46315728</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction"]]></title><description><![CDATA[
<p>we're more focused on the core extraction layer itself rather than workflow tooling. we train our own vision models for layout detection, ocr, and table parsing from scratch. the key thing for us is determinism and auditability, so outputs are reproducible run over run, which matters a lot for regulated enterprises.</p>
]]></description><pubDate>Thu, 18 Dec 2025 17:23:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=46315704</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=46315704</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46315704</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "[dead]"]]></title><description><![CDATA[
<p>DeepSeek AI just released DeepSeek-OCR, a new open-source model that aims to rethink text extraction through what it calls Context Optical Compression. The launch quickly caught attention on X and GitHub, with many celebrating another big step in open document AI.<p>At Pulse, we were curious how it performs on the kinds of messy, high-density documents that power real business workflows. So we ran DeepSeek-OCR through our standard evaluation suite: multi-page PDFs, handwritten forms, nested tables, and scanned statements. The results were promising in theory but inconsistent in practice.</p>
]]></description><pubDate>Mon, 20 Oct 2025 19:51:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=45648426</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=45648426</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45648426</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "AInertia – The Adoption Problem Outside the Bubble"]]></title><description><![CDATA[
<p>interesting read</p>
]]></description><pubDate>Thu, 04 Sep 2025 14:34:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=45127745</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=45127745</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45127745</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "[dead]"]]></title><description><![CDATA[
<p>We processed hundreds of millions of pages and found that a single accuracy metric is misleading. A model that's 98% accurate on 1,000 pages with 200 data elements each still produces 4,000 incorrect values. The real killers are broken reading order in multi-column layouts, shifted table columns, and lost cross-page context that silently corrupt datasets without throwing errors.</p>
]]></description><pubDate>Tue, 12 Aug 2025 15:36:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=44877687</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=44877687</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44877687</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "[dead]"]]></title><description><![CDATA[
<p>We evaluated ByteDance's Dolphin document parsing model on enterprise document processing tasks using standardized benchmarks and real-world document sets. Our testing dataset included 847 financial documents, 312 legal forms, and 156 academia research publications to assess performance across critical enterprise use cases.</p>
]]></description><pubDate>Tue, 24 Jun 2025 14:52:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=44366915</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=44366915</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44366915</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "[dead]"]]></title><description><![CDATA[
<p>After processing nearly 500 million pages of enterprise documents, we've discovered that the biggest challenge in document AI isn't character recognition or table extraction. It's something far more fundamental: understanding how information flows across page breaks, column boundaries, and interrupted sections.</p>
]]></description><pubDate>Tue, 27 May 2025 13:13:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=44106651</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=44106651</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44106651</guid></item><item><title><![CDATA[Why Semantic Understanding Breaks at Page Boundaries]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.runpulse.com/blog/the-document-continuity-problem">https://www.runpulse.com/blog/the-document-continuity-problem</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44106650">https://news.ycombinator.com/item?id=44106650</a></p>
<p>Points: 2</p>
<p># Comments: 1</p>
]]></description><pubDate>Tue, 27 May 2025 13:13:13 +0000</pubDate><link>https://www.runpulse.com/blog/the-document-continuity-problem</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=44106650</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44106650</guid></item><item><title><![CDATA[Legacy OCR Tools Are Failing the Legal Industry: Here's Why]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.runpulse.com/blog/legacy-ocr-tools-are-failing-the-legal-industry-heres-why">https://www.runpulse.com/blog/legacy-ocr-tools-are-failing-the-legal-industry-heres-why</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43389976">https://news.ycombinator.com/item?id=43389976</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Mon, 17 Mar 2025 16:07:48 +0000</pubDate><link>https://www.runpulse.com/blog/legacy-ocr-tools-are-failing-the-legal-industry-heres-why</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=43389976</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43389976</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Launch HN: Sift Dev (YC W25) – AI-Powered Datadog Alternative"]]></title><description><![CDATA[
<p>curious how LLM hallucinations will work on logging info - gonna be a hard problem to solve</p>
]]></description><pubDate>Wed, 12 Mar 2025 03:06:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=43339568</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=43339568</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43339568</guid></item><item><title><![CDATA[New comment by ritvikpandey21 in "Mistral OCR"]]></title><description><![CDATA[
<p>as builders in this space, we decided to put it to the test on complex nested tables, pie charts, etc. to see if the same VLM hallucination issues persist, and to what degree. while results were promising, we found several critical failure nodes across two document domains.<p>check out our blog post here! <a href="https://www.runpulse.com/blog/beyond-the-hype-real-world-tests-of-mistrals-ocr">https://www.runpulse.com/blog/beyond-the-hype-real-world-tes...</a></p>
]]></description><pubDate>Fri, 07 Mar 2025 00:11:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=43286351</link><dc:creator>ritvikpandey21</dc:creator><comments>https://news.ycombinator.com/item?id=43286351</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43286351</guid></item></channel></rss>