Hacker News: Show HN

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

zambelli — Tue, 19 May 2026 12:23:07 +0000

Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.

I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.

What it does:

- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware

- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it

- Ships with an eval harness and interactive dashboard so you can reproduce every number

I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.

Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)

The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:

- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.

- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.

- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.

I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).

The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.

One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.

Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.

Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.

How to try it:

- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.

- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.

- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.

Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.

Repo: https://github.com/antoinezambelli/forge

Paper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...

Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...

Comments URL: https://news.ycombinator.com/item?id=48192383

Points: 410

# Comments: 158

Show HN: Gaussian Splat of a Strawberry

danybittel — Tue, 19 May 2026 10:38:47 +0000

The Setup:

https://i.imgur.com/o0hgybh.jpeg

https://i.imgur.com/mcNiomp.jpeg

https://i.imgur.com/vIjw6pc.jpeg

https://i.imgur.com/nzOwmSC.jpeg

Comments URL: https://news.ycombinator.com/item?id=48191602

Points: 494

# Comments: 190

Show HN: Files.md – Open-source alternative to Obsidian

zakirullin — Mon, 18 May 2026 13:33:33 +0000

Article URL: https://github.com/zakirullin/files.md

Comments URL: https://news.ycombinator.com/item?id=48179677

Points: 697

# Comments: 339

Show HN: Auto-identity-remove – Automated data broker opt-out runner for macOS

stephenlthorn — Mon, 18 May 2026 11:32:42 +0000

Article URL: https://github.com/stephenlthorn/auto-identity-remove

Comments URL: https://news.ycombinator.com/item?id=48178184

Points: 323

# Comments: 134

Show HN: Semble – Code search for agents that uses 98% fewer tokens than grep

Bibabomas — Sun, 17 May 2026 15:37:07 +0000

Hey HN! We (Stephan and Thomas) recently open-sourced Semble. We kept running into the same problem while using Claude Code on large codebases: when the agent can't find something directly, it falls back to grep, reading full files or launching subagents. This uses a lot of tokens, and often still misses the relevant code. There are existing tools for this, but they were either too slow to index on demand, needed API keys, or had poor retrieval quality.

Semble is our solution for this. It combines static Model2Vec embeddings (using our latest static model: potion-code-16M) with BM25, fused via RRF and reranked with code-aware signals. Everything runs on CPU since there's no transformers involved. On our benchmark of ~1250 query/document pairs across 63 repos and 19 languages, it uses 98% fewer tokens than grep+read and reaches 99% of the retrieval quality of a 137M-parameter code-trained transformer, while being ~200x faster.

Main features:

- Token-efficient: 98% fewer tokens than grep+read

- Fast: ~250ms to index a typical repo on our benchmark, ~1.5ms per query on CPU (very large repos may take longer)

- Accurate: 0.854 NDCG@10, 99% of the best transformer setup we tested

- MCP server: drop-in for Claude Code, Cursor, Codex, OpenCode

- Zero config: no API keys, no GPU, no external services

Install in Claude Code with: claude mcp add semble -s user -- uvx --from "semble[mcp]" semble

Or check our README for other installation instructions, benchmarks, and methodology:

Semble: https://github.com/MinishLab/semble

Benchmarks: https://github.com/MinishLab/semble/tree/main/benchmarks

Model: https://huggingface.co/minishlab/potion-code-16M

Let us know if you have any feedback or questions!

Comments URL: https://news.ycombinator.com/item?id=48169874

Points: 439

# Comments: 147

Show HN: Rocksky – Music scrobbling and discovery on the AT Protocol

tsiry — Sat, 16 May 2026 17:00:41 +0000

Article URL: https://tangled.org/rocksky.app/rocksky

Comments URL: https://news.ycombinator.com/item?id=48161881

Points: 117

# Comments: 44

Show HN: Burn, baby, burn (those tokens)

dtnewman — Fri, 15 May 2026 17:20:26 +0000

Article URL: https://github.com/dtnewman/burn-baby-burn

Comments URL: https://news.ycombinator.com/item?id=48151287

Points: 134

# Comments: 30

Show HN: Find the best local LLM for your hardware, ranked by benchmarks

andyyyy64 — Fri, 15 May 2026 09:19:24 +0000

Article URL: https://github.com/Andyyyy64/whichllm

Comments URL: https://news.ycombinator.com/item?id=48146369

Points: 283

# Comments: 68

Show HN: Watch a neural net learn to play Snake

c1b — Thu, 14 May 2026 15:35:14 +0000

In browser PPO training demo, made possible by tinygrad: TinyJit -> WebGPU kernels.

Requires WebGPU.

Comments URL: https://news.ycombinator.com/item?id=48136981

Points: 203

# Comments: 47

Show HN: Running the second public ODoH relay

rdme — Thu, 14 May 2026 10:44:50 +0000

Every privacy-focused DNS service requires an account: NextDNS, Cloudflare for Families, Apple's iCloud Private Relay (paid, iOS-only). The protocol that doesn’t require one - ODoH - had basically one well-known public relay operator (Frank Denis on Fastly Compute, default in dnscrypt-proxy). I built a second one and the client to talk to it.

Comments URL: https://news.ycombinator.com/item?id=48133561

Points: 125

# Comments: 42

Show HN: Number Gacha, a gacha game distilled to its essence

babel16 — Wed, 13 May 2026 15:39:41 +0000

Number Gacha is a half-parody, half-real gacha game where you roll, unwrap, and battle numbers. Play on Desktop for the best experience!

Comments URL: https://news.ycombinator.com/item?id=48123359

Points: 242

# Comments: 122

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

HenryNdubuaku — Tue, 12 May 2026 18:03:11 +0000

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).

Training: - Pretrained on 200B tokens across 16 TPU v6e (27 hours) - Post-trained on 2B tokens of synthesized function-calling data (45 minutes) - Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle

The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.

While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.

This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544

Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle GitHub: https://github.com/cactus-compute/needle

Comments URL: https://news.ycombinator.com/item?id=48111896

Points: 774

# Comments: 211

Show HN: Statewright – Visual state machines that make AI agents reliable

azurewraith — Tue, 12 May 2026 14:24:55 +0000

Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

I'm Ben Cochran, I spent 20+ years in the trenches with full-stack Engineering, DevOps, high performance computing & ML with stints at NVIDIA, AMD and various other organizations most recently as a Distinguished Engineer.

For agents to work reliably you either need massive parameter counts or massive context windows to keep the solution spaces workable. Most people are brute forcing reliability with bigger models and longer prompts.

What if I made the problem smaller instead of making the model bigger?

I took a different approach by using smaller models: models in the 13-20B parameter range and set them to task solving real SWE-bench problems. I constrained the tool and solution spaces using formal state machines. Each state in the machine defines which tools the model can access, how many iterations it gets and what transitions are valid. A planning state gets read-only tools. An implementation state gets edit tools (scoped to prevent mega edits) and write friendly bash tools. The testing state gets bash but only for testing commands. The model cannot physically skip steps or use the wrong tool at the wrong time. It is enforced via protocol, not via prompts.

The results were more promising than I would have expected. Across multiple model families irrespective of age (qwen-coder, gpt-oss, gemma4) and the improvements were consistent above the 13B parameter inflection point. Below that, models can navigate the state machine but can't retain enough context to produce accurate edits. More on the research bit: https://statewright.ai/research

Surprisingly this yielded improvements in frontier models as well. Haiku and Sonnet start to punch above their weight and Opus solves more reliably with fewer tokens and death spirals. Fine tuning did not yield these kinds of functional improvements for me. The takeaway it seems is that context window utilization matters more than raw context size - a tightly scoped working context at each step outperforms a model given carte blanche over everything. Constraining LLMs which are non-idempotent by using deterministic code is a pattern that nobody is currently talking about.

So, I built Statewright. Its core is a Rust engine that evaluates state machine definitions: states, transitions, guards and tool restrictions. Its orchestration doesn't use an LLM, just enforces the state machine. On top of that is a plugin layer that integrates with Claude Code (and soon Codex, Cursor and others) via MCP. When you activate a workflow, hooks enforce the guardrails per state automatically. The model sees 5 tools available instead of dozens, gets clear instructions for the current phase and transitions when conditions are met. Importantly it tells the model when it's attempting to do something that isn't in scope, incorrect or when it needs to try something else after getting stuck.

You can use your agent via MCP to build a state machine for you to solve a problem in your current context. The visual editor at statewright.ai lets you tweak these workflows in a graph view... You can clearly see the failure paths, the retry loops and the approval gates. State machines aren't DAGs; they loop and retry, which is what agentic work actually needs.

Statewright is currently live with a free tier, try it out in Claude Code by running the following:

/plugin marketplace add statewright/statewright

/plugin install statewright

/reload-plugins

Then "start the bugfix workflow" or /statewright start bugfix. You'll need to paste your API key when prompted. The latest versions of Claude may complain -- paste the API key again and say you really mean it, Claude is just being cautious here.

Feedback is welcome on the workflow editor, the plugin experience, and tell me what workflows you'd want to build first. Agents are suggestions, states are laws.

Comments URL: https://news.ycombinator.com/item?id=48108778

Points: 126

# Comments: 56

Show HN: OpenGravity – A zero-install, BYOK vanilla JS clone of Antigravity

ab613 — Mon, 11 May 2026 20:23:25 +0000

Hi. I’m a high school student studying for my GCSEs. I was using Google Antigravity heavily for my side projects, but I kept hitting the usage limits, and getting random "agent terminated" errors. So I decided to try build my own version of the IDE. I love the UI, so I copied it as accurately as possible, and then hooked up some logic into it, including the INCREDIBLY finicky webcontainer api.

I tried to keep it super lightweight, no build steps, or dependencies, and now that its open source, I'm hoping people can build things on top of it that arent possible with closed source tools, like complex custom agent workflows.

Some screenshots: - https://github.com/ab-613/OpenGravity/blob/main/examples/scr... - https://github.com/ab-613/OpenGravity/blob/main/examples/htm...

What it's made from:

- Pure Vanilla JS: no react, vue, or build step. Built entirely in plain HTML/CSS/JS to keep it super lightweight.

- WebContainer API and xterm.js: Instead of faking a terminal, I (after much pain) hooked up the WebContainer API so the AI agent has a real, in browser linux environment to run shell commands, install dependencies, and edit local files.

- BYOK (Bring Your Own Key): API key ALWAYS stays in localStorage.

Whats currently happening:

- It works, but it's an alpha. The AI can proactively start projects going properly and edit files, but because I built this over a few days before my exams, a lot of the UI dropdowns and buttons are currently just hardcoded placeholders.

- I’m open sourcing it early because I think the foundation of a Vanilla JS + WebContainer IDE is really strong, and I'd love to see where the community takes it while I'm doing my exams.

- Live demo: https://opengravity.pages.dev (Zoom out to 80% if not full screen. It will prompt for a gemini api key on load). Start by uploading a folder, then you can fiddle with the terminal and agent, and see how it goes!

Would love to hear feedback on the code, the WebContainer integration, or how to improve the agent loop!

Comments URL: https://news.ycombinator.com/item?id=48100192

Points: 106

# Comments: 30

Show HN: TikTok but for scientific papers

ciwrl — Mon, 11 May 2026 16:05:50 +0000

Article URL: https://andreaturchet.github.io/website/index.html

Comments URL: https://news.ycombinator.com/item?id=48096842

Points: 200

# Comments: 80

Show HN: An index of indie web/blog indexes

rocketpastsix — Sun, 10 May 2026 12:50:31 +0000

I saw a comment here about how there are so many indexes of indie sites, blogs, etc but there wasn't an index of all the indexes. So I built it. It doesn't require a log in, just go browse! I've curated about 30 or so, but there is a submission form if there are ones I am missing.

Also happy to take UI improvements because I am not great in that area!

Comments URL: https://news.ycombinator.com/item?id=48083580

Points: 154

# Comments: 39

Show HN: Building a web server in assembly to give my life (a lack of) meaning

imtomt — Sun, 10 May 2026 03:01:44 +0000

This is ymawky, a static file web server for MacOS written entirely in ARM64 assembly. It supports GET, PUT, DELETE, HEAD, and OPTIONS requests, and supports Range: bytes=X-Y headers (which allows scrubbing for video streaming). It decodes percent-encoded URLs, strictly enforces docroot, serves custom error pages for any HTTP error response, supports directory listing, and has (some) mitigations against slowloris-like attacks.

I’ve also written a more detailed writeup here: https://imtomt.github.io/ymawky/

Comments URL: https://news.ycombinator.com/item?id=48080587

Points: 430

# Comments: 228

Show HN: Rust but Lisp

thatxliner — Sat, 09 May 2026 21:46:27 +0000

Article URL: https://github.com/ThatXliner/rust-but-lisp

Comments URL: https://news.ycombinator.com/item?id=48078575

Points: 218

# Comments: 74

Show HN: I made a Clojure-like language in Go, boots in 7ms

marcingas — Sat, 09 May 2026 17:52:13 +0000

Let-go is a Clojure-like language (~90% compatible with JVM Clojure) written in pure Go. It ships as a ~10MB static binary and cold boots in ~7ms - that's about 50x faster than JVM and 3x faster than Babashka. It has decent throughput on algorithmic workloads - within ballpark of the GraalVM-backed sci.

I started this project in 2021 as an elaborate practical joke: I wanted to have an excuse for writing Clojure while pretending to write Go.

Jokes aside, it turned out to be pretty decent: it feels like real Clojure, it has an nREPL server (supported in Calva, CIDER, etc.), it's easily embeddable in your Go programs (funcs, structs and channels cross the boundary without fuss). It's good for writing CLIs, web servers, data processing scripts and even doing some systems programming - I used it to write a deamonless container runtime. Oh, and it runs on Plan9.

Under the hood there is a fairly simple compiler and a stack VM, both handcrafted specifically for running Clojure-like code. The compiler can work in AOT mode producing portable bytecode blobs and standalone binaries (runtime+bytecode).

This is not a drop-in replacement for Clojure in general - it does not load JARs, it does not have all Java APIs and it most probably won't run your exiting Clojure projects without modifications. At least not at the moment.