Hacker News: EarlyOom

New comment by EarlyOom in "How we solved multi-modal tool-calling in MCP agents – VLM Run MCP"

EarlyOom — Wed, 02 Jul 2025 19:27:58 +0000

Shocking how poor frontier models perform on simple visual tasks. Best-in-domain tool calling will Become the norm

New comment by EarlyOom in "Ask HN: Who is hiring? (March 2025)"

EarlyOom — Mon, 03 Mar 2025 23:02:56 +0000

VLM Run is a first-of-its-kind API dedicated to running Vision Language Models on Documents, Images, and Video. We’re building a stack from the bottom-up for ‘Visual’ applications of language models that we believe will make up > 90% of inference needs in the next 5 years.

Hybrid from Bay Area, CA

Looking for experience in any of the following:

* ML Domains: Vision Language Models, LLMs, Temporal/Video Models

* Model Training, Evaluation, and Versioning platforms: WnB, Huggingface

* Infra: Python, Pytorch, Pydantic, CUDA, Torch.compile

* Devops: Github CI, Docker, Conda, API Billing and Monitoring

https://vlm-run.notion.site/vlm-run-hiring-25q1

New comment by EarlyOom in "Replace OCR with Vision Language Models"

EarlyOom — Wed, 26 Feb 2025 22:28:04 +0000

This is the main focus of VLM Run and typed extraction more generally. If you provide proper type constraints (e.g. with Pydantic) you can dramatically reduce the surface area for hallucination. Then there's actually fine-tuning on your dataset (we're working on this) to push accuracy beyond what you get from an unspecialized frontier model.

New comment by EarlyOom in "Replace OCR with Vision Language Models"

EarlyOom — Wed, 26 Feb 2025 21:32:15 +0000

You can try out some of our schemas with Ollama if you want: https://github.com/vlm-run/vlmrun-hub (instructions in Readme)

New comment by EarlyOom in "Replace OCR with Vision Language Models"

EarlyOom — Wed, 26 Feb 2025 21:21:42 +0000

VLMs are able to take context into account when filling in fields, following either a global or field specific prompt. This is great for e.g. unlabeled axes, checking a legend for units to be suffixed after a number, etc. Also, you catch lots of really simple errors with type hints (e.g. dates, addresses, country codes etc.).

New comment by EarlyOom in "Replace OCR with Vision Language Models"

EarlyOom — Wed, 26 Feb 2025 21:19:36 +0000

You can! it works with Ollama https://github.com/vlm-run/vlmrun-hub

At the end of the day its just schemas. You can decide for yourself if its work upgrading to a larger, more expensive model.

New comment by EarlyOom in "Replace OCR with Vision Language Models"

EarlyOom — Wed, 26 Feb 2025 21:19:10 +0000

We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...

New comment by EarlyOom in "Replace OCR with Vision Language Models"

EarlyOom — Wed, 26 Feb 2025 21:18:35 +0000

We convert to a JSON schema, but it would be trivial to convert this to yaml. There are some minor differences in e.g. tokens required to output JSON vs yaml which is why we've opted for our strategy.

Replace OCR with Vision Language Models

EarlyOom — Wed, 26 Feb 2025 19:29:37 +0000

Article URL: https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/01_schema_showcase.ipynb

Comments URL: https://news.ycombinator.com/item?id=43187209

Points: 292

# Comments: 125

New comment by EarlyOom in "Show HN: Benchmarking VLMs vs. Traditional OCR"

EarlyOom — Fri, 21 Feb 2025 21:38:41 +0000

OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.

Show HN: Visually parse an entire YouTube video frame by frame

EarlyOom — Fri, 21 Feb 2025 21:30:13 +0000

Article URL: https://github.com/vlm-run/vlmrun-cookbook/blob/main/notebooks/03_case_study_tv_news.ipynb

Comments URL: https://news.ycombinator.com/item?id=43133264

Points: 5

# Comments: 0

Ask HN: What are folks using to train/fine-tune Vision Language Models

EarlyOom — Fri, 21 Feb 2025 21:22:12 +0000

Comments URL: https://news.ycombinator.com/item?id=43133162

Points: 1

# Comments: 0

A Node.js SDK for calling Vision Language Models

EarlyOom — Thu, 20 Feb 2025 21:22:03 +0000

Article URL: https://github.com/vlm-run/vlmrun-node-sdk

Comments URL: https://news.ycombinator.com/item?id=43120375

Points: 6

# Comments: 0

New comment by EarlyOom in "Run structured extraction on documents/images locally with Ollama and Pydantic"

EarlyOom — Thu, 20 Feb 2025 20:28:50 +0000

Would love to chat! reach out scott@vlm.run

New comment by EarlyOom in "Run structured extraction on documents/images locally with Ollama and Pydantic"

EarlyOom — Thu, 20 Feb 2025 06:25:15 +0000

That's one of our main focuses, yes: https://docs.vlm.run/api-reference/v1/fine-tuning/post-finet...

New comment by EarlyOom in "Run structured extraction on documents/images locally with Ollama and Pydantic"

EarlyOom — Thu, 20 Feb 2025 01:54:10 +0000

We put together an open-source collection of Pydantic schemas for a variety of document categories (W2 filings, invoices etc.), including instructions for how to get structured JSON responses from any visual input with the model of your choosing. Run everything locally.

Run structured extraction on documents/images locally with Ollama and Pydantic

EarlyOom — Thu, 20 Feb 2025 01:54:10 +0000

Article URL: https://github.com/vlm-run/vlmrun-hub

Comments URL: https://news.ycombinator.com/item?id=43110173

Points: 170

# Comments: 29

New comment by EarlyOom in "Ask HN: Who is hiring? (February 2025)"

EarlyOom — Mon, 03 Feb 2025 20:32:38 +0000

Hybrid from Bay Area, CA

Looking for experience in any of the following: * ML Domains: Vision Language Models, LLMs, Temporal/Video Models * Model Training, Evaluation, and Versioning platforms: WnB, Huggingface * Infra: Python, Pytorch, Pydantic, CUDA, Torch.compile * Devops: Github CI, Docker, Conda, API Billing and Monitoring

https://vlm-run.notion.site/vlm-run-hiring-25q1

Show HN: Vlm Run, Extract JSON from images, videos and documents in a simple API

EarlyOom — Tue, 13 Aug 2024 18:53:46 +0000

Hey HN,

We’ve been building out an API for ‘Visual ETL’ that we call vlm.run. We’ve been working with foundation models (GPT4o, Gemini) for a few months and kept running into failure modes like:

- Hallucinations: even the best foundation models continue to hallucinate outputs for complex visual inputs, even when adhering to a schema.

- Rate limits: frontier models like GPT4o are still too expensive or rate limited for high volume visual data. Our API is designed for production workloads which means speed, stability, monitoring and, if needed, private deployments.

- Off the shelf schemas: Defining a schema takes trial and error to get right. We’ve put together a taxonomy for common visual tasks that are ready to go from day 1.

Some examples we’ve put together:

- Presentations: https://docs.vlm.run/guides/guide-pdf-presentations

- TV News: https://docs.vlm.run/guides/guide-tv-news

Sign up for an API key and try us out on a 2 week free trial. Check out our docs at https://docs.vlm.run/what-is-vlm-1 and reach out if you have questions!

Comments URL: https://news.ycombinator.com/item?id=41238373

Points: 2

# Comments: 0

New comment by EarlyOom in "Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data"

EarlyOom — Tue, 13 Aug 2024 18:29:56 +0000

Curious how this compares to platforms like https://unstructured.io/