<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: EarlyOom</title><link>https://news.ycombinator.com/user?id=EarlyOom</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sun, 19 Apr 2026 20:20:18 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=EarlyOom" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by EarlyOom in "How we solved multi-modal tool-calling in MCP agents – VLM Run MCP"]]></title><description><![CDATA[
<p>Shocking how poor frontier models perform on simple visual tasks. Best-in-domain tool calling will
Become the norm</p>
]]></description><pubDate>Wed, 02 Jul 2025 19:27:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=44447832</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=44447832</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44447832</guid></item><item><title><![CDATA[New comment by EarlyOom in "Ask HN: Who is hiring? (March 2025)"]]></title><description><![CDATA[
<p>VLM Run | Member of Technical Staff, ML Systems | Full-time | Hybrid Bay Area, CA | <a href="https://vlm.run" rel="nofollow">https://vlm.run</a> | 150k-220k / yr + Equity<p>VLM Run is a first-of-its-kind API dedicated to running Vision Language Models on Documents, Images, and Video. We’re building a stack from the bottom-up for ‘Visual’ applications of language models that we believe will make up > 90% of inference needs in the next 5 years.<p>Hybrid from Bay Area, CA<p>Looking for experience in any of the following:<p>* ML Domains: Vision Language Models, LLMs, Temporal/Video Models<p>* Model Training, Evaluation, and Versioning platforms: WnB, Huggingface<p>* Infra: Python, Pytorch, Pydantic, CUDA, Torch.compile<p>* Devops: Github CI, Docker, Conda, API Billing and Monitoring<p><a href="https://vlm-run.notion.site/vlm-run-hiring-25q1" rel="nofollow">https://vlm-run.notion.site/vlm-run-hiring-25q1</a></p>
]]></description><pubDate>Mon, 03 Mar 2025 23:02:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=43247886</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43247886</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43247886</guid></item><item><title><![CDATA[New comment by EarlyOom in "Replace OCR with Vision Language Models"]]></title><description><![CDATA[
<p>This is the main focus of VLM Run and typed extraction more generally. If you provide proper type constraints (e.g. with Pydantic) you can dramatically reduce the surface area for hallucination. Then there's actually fine-tuning on your dataset (we're working on this) to push accuracy beyond what you get from an unspecialized frontier model.</p>
]]></description><pubDate>Wed, 26 Feb 2025 22:28:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=43188999</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43188999</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43188999</guid></item><item><title><![CDATA[New comment by EarlyOom in "Replace OCR with Vision Language Models"]]></title><description><![CDATA[
<p>You can try out some of our schemas with Ollama if you want: <a href="https://github.com/vlm-run/vlmrun-hub">https://github.com/vlm-run/vlmrun-hub</a> (instructions in Readme)</p>
]]></description><pubDate>Wed, 26 Feb 2025 21:32:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=43188474</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43188474</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43188474</guid></item><item><title><![CDATA[New comment by EarlyOom in "Replace OCR with Vision Language Models"]]></title><description><![CDATA[
<p>VLMs are able to take context into account when filling in fields, following either a global or field specific prompt. This is great for e.g. unlabeled axes, checking a legend for units to be suffixed after a number, etc. Also, you catch lots of really simple errors with type hints (e.g. dates, addresses, country codes etc.).</p>
]]></description><pubDate>Wed, 26 Feb 2025 21:21:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=43188372</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43188372</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43188372</guid></item><item><title><![CDATA[New comment by EarlyOom in "Replace OCR with Vision Language Models"]]></title><description><![CDATA[
<p>You can! it works with Ollama <a href="https://github.com/vlm-run/vlmrun-hub">https://github.com/vlm-run/vlmrun-hub</a><p>At the end of the day its just schemas. You can decide for yourself if its work upgrading to a larger, more expensive model.</p>
]]></description><pubDate>Wed, 26 Feb 2025 21:19:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=43188349</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43188349</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43188349</guid></item><item><title><![CDATA[New comment by EarlyOom in "Replace OCR with Vision Language Models"]]></title><description><![CDATA[
<p>We can do bounding boxes too :) we just call it visual grounding <a href="https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/04_visual_grounding.ipynb">https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...</a></p>
]]></description><pubDate>Wed, 26 Feb 2025 21:19:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=43188344</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43188344</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43188344</guid></item><item><title><![CDATA[New comment by EarlyOom in "Replace OCR with Vision Language Models"]]></title><description><![CDATA[
<p>We convert to a JSON schema, but it would be trivial to convert this to yaml. There are some minor differences in e.g. tokens required to output JSON vs yaml which is why we've opted for our strategy.</p>
]]></description><pubDate>Wed, 26 Feb 2025 21:18:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=43188337</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43188337</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43188337</guid></item><item><title><![CDATA[Replace OCR with Vision Language Models]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/01_schema_showcase.ipynb">https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/01_schema_showcase.ipynb</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43187209">https://news.ycombinator.com/item?id=43187209</a></p>
<p>Points: 292</p>
<p># Comments: 125</p>
]]></description><pubDate>Wed, 26 Feb 2025 19:29:37 +0000</pubDate><link>https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/01_schema_showcase.ipynb</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43187209</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43187209</guid></item><item><title><![CDATA[New comment by EarlyOom in "Show HN: Benchmarking VLMs vs. Traditional OCR"]]></title><description><![CDATA[
<p>OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. <a href="https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-at-once-to-readable-orientation/" rel="nofollow">https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...</a>) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.</p>
]]></description><pubDate>Fri, 21 Feb 2025 21:38:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=43133356</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43133356</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43133356</guid></item><item><title><![CDATA[Show HN: Visually parse an entire YouTube video frame by frame]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/03_case_study_tv_news.ipynb">https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/03_case_study_tv_news.ipynb</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43133264">https://news.ycombinator.com/item?id=43133264</a></p>
<p>Points: 5</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 21 Feb 2025 21:30:13 +0000</pubDate><link>https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/03_case_study_tv_news.ipynb</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43133264</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43133264</guid></item><item><title><![CDATA[Ask HN: What are folks using to train/fine-tune Vision Language Models]]></title><description><![CDATA[

<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43133162">https://news.ycombinator.com/item?id=43133162</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 21 Feb 2025 21:22:12 +0000</pubDate><link>https://news.ycombinator.com/item?id=43133162</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43133162</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43133162</guid></item><item><title><![CDATA[A Node.js SDK for calling Vision Language Models]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/vlm-run/vlmrun-node-sdk">https://github.com/vlm-run/vlmrun-node-sdk</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43120375">https://news.ycombinator.com/item?id=43120375</a></p>
<p>Points: 6</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 20 Feb 2025 21:22:03 +0000</pubDate><link>https://github.com/vlm-run/vlmrun-node-sdk</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43120375</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43120375</guid></item><item><title><![CDATA[New comment by EarlyOom in "Run structured extraction on documents/images locally with Ollama and Pydantic"]]></title><description><![CDATA[
<p>Would love to chat! reach out scott@vlm.run</p>
]]></description><pubDate>Thu, 20 Feb 2025 20:28:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=43119750</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43119750</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43119750</guid></item><item><title><![CDATA[New comment by EarlyOom in "Run structured extraction on documents/images locally with Ollama and Pydantic"]]></title><description><![CDATA[
<p>That's one of our main focuses, yes: <a href="https://docs.vlm.run/api-reference/v1/fine-tuning/post-finetuning-create#create-finetuning-job" rel="nofollow">https://docs.vlm.run/api-reference/v1/fine-tuning/post-finet...</a></p>
]]></description><pubDate>Thu, 20 Feb 2025 06:25:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=43111724</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43111724</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43111724</guid></item><item><title><![CDATA[New comment by EarlyOom in "Run structured extraction on documents/images locally with Ollama and Pydantic"]]></title><description><![CDATA[
<p>We put together an open-source collection of Pydantic schemas for a variety of document categories (W2 filings, invoices etc.), including instructions for how to get structured JSON responses from any visual input with the model of your choosing. Run everything locally.</p>
]]></description><pubDate>Thu, 20 Feb 2025 01:54:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=43110174</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43110174</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43110174</guid></item><item><title><![CDATA[Run structured extraction on documents/images locally with Ollama and Pydantic]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/vlm-run/vlmrun-hub">https://github.com/vlm-run/vlmrun-hub</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43110173">https://news.ycombinator.com/item?id=43110173</a></p>
<p>Points: 170</p>
<p># Comments: 29</p>
]]></description><pubDate>Thu, 20 Feb 2025 01:54:10 +0000</pubDate><link>https://github.com/vlm-run/vlmrun-hub</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=43110173</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43110173</guid></item><item><title><![CDATA[New comment by EarlyOom in "Ask HN: Who is hiring? (February 2025)"]]></title><description><![CDATA[
<p>VLM Run | Member of Technical Staff, ML Systems, Developer Relations | Full-time | Bay Area, CA | <a href="https://vlm.run" rel="nofollow">https://vlm.run</a> | 150k-220k / yr + Equity<p>VLM Run is a first-of-its-kind API dedicated to running Vision Language Models on Documents, Images, and Video. We’re building a stack from the bottom-up for ‘Visual’ applications of language models that we believe will make up > 90% of inference needs in the next 5 years.<p>Hybrid from Bay Area, CA<p>Looking for experience in any of the following:
* ML Domains: Vision Language Models, LLMs, Temporal/Video Models
* Model Training, Evaluation, and Versioning platforms: WnB, Huggingface
* Infra: Python, Pytorch, Pydantic, CUDA, Torch.compile
* Devops: Github CI, Docker, Conda, API Billing and Monitoring<p><a href="https://vlm-run.notion.site/vlm-run-hiring-25q1" rel="nofollow">https://vlm-run.notion.site/vlm-run-hiring-25q1</a></p>
]]></description><pubDate>Mon, 03 Feb 2025 20:32:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=42922656</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=42922656</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42922656</guid></item><item><title><![CDATA[Show HN: Vlm Run, Extract JSON from images, videos and documents in a simple API]]></title><description><![CDATA[
<p>Hey HN,<p>We’ve been building out an API for ‘Visual ETL’ that we call vlm.run. We’ve been working with foundation models (GPT4o, Gemini) for a few months and kept running into failure modes like:<p>- Hallucinations: even the best foundation models continue to hallucinate outputs for complex visual inputs, even when adhering to a schema.<p>- Rate limits: frontier models like GPT4o are still too expensive or rate limited for high volume visual data. Our API is designed for production workloads which means speed, stability, monitoring and, if needed, private deployments.<p>- Off the shelf schemas: Defining a schema takes trial and error to get right. We’ve put together a taxonomy for common visual tasks that are ready to go from day 1.<p>Some examples we’ve put together:<p>- Presentations: <a href="https://docs.vlm.run/guides/guide-pdf-presentations" rel="nofollow">https://docs.vlm.run/guides/guide-pdf-presentations</a><p>- TV News: <a href="https://docs.vlm.run/guides/guide-tv-news" rel="nofollow">https://docs.vlm.run/guides/guide-tv-news</a><p>Sign up for an API key and try us out on a 2 week free trial. Check out our docs at <a href="https://docs.vlm.run/what-is-vlm-1" rel="nofollow">https://docs.vlm.run/what-is-vlm-1</a> and reach out if you have questions!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=41238373">https://news.ycombinator.com/item?id=41238373</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 13 Aug 2024 18:53:46 +0000</pubDate><link>https://vlm.run/</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=41238373</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41238373</guid></item><item><title><![CDATA[New comment by EarlyOom in "Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data"]]></title><description><![CDATA[
<p>Curious how this compares to platforms like <a href="https://unstructured.io/" rel="nofollow">https://unstructured.io/</a></p>
]]></description><pubDate>Tue, 13 Aug 2024 18:29:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=41238066</link><dc:creator>EarlyOom</dc:creator><comments>https://news.ycombinator.com/item?id=41238066</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41238066</guid></item></channel></rss>