Hacker News: devonkelley

New comment by devonkelley in "Minions: Stripe’s one-shot, end-to-end coding agents"

devonkelley — Mon, 23 Feb 2026 19:29:24 +0000

The add/remove churn is the thing nobody talks about enough. We track net lines changed vs gross lines changed as a ratio and it's genuinely horrifying sometimes. An agent will touch 400 lines to net 30 lines of actual progress. And the worst part is the removed code was often fine, the agent just decided to refactor something it didn't need to touch while it was in there. Shorter, tighter loops with clearer scope boundaries have helped us more than any orchestration improvement.

New comment by devonkelley in "Claws are now a new layer on top of LLM agents"

devonkelley — Mon, 23 Feb 2026 19:26:13 +0000

The approval-link pattern for gating dangerous actions is something I keep coming back to as well, way more robust than trying to teach the agent what's "safe" vs not. How do you handle the case where the agent needs the result of the gated action to continue its chain? Does it block and wait, or does it park the whole task? The suspend/resume problem is where most of these setups get messy in my experience.

New comment by devonkelley in "AGENTS.md outperforms skills in our agent evals"

devonkelley — Fri, 30 Jan 2026 05:45:37 +0000

I made an account years ago, never posted, and decided I want to be more active in the community.

Green accounts probably bc I sent my post to some friends and users directly when I made it. Is that illegal on HN? I legit don't know how things work here. I was excited over my launch post.

Anyways, not a fucking bot, my company is real, the commenters on my post are real and if it's a crime for me to rapid fire post and/or have friends comment on my Show HN, good to know.

New comment by devonkelley in "Show HN: Kalibr – Autonomous Routing for AI Agents"

devonkelley — Fri, 30 Jan 2026 05:34:12 +0000

@tomhow emailing to verify that I am real, the comments on my post are not bots and I can verify whatever you need me to. Annoying to be flagged when I am just new here and trying to be part of the community.

New comment by devonkelley in "Show HN: Kalibr – Autonomous Routing for AI Agents"

devonkelley — Fri, 30 Jan 2026 05:24:43 +0000

These are real people dude. I know some of them. Some are users or friends who came to comment on our post. They aren't bots and neither am I. Just new to HN.

New comment by devonkelley in "AGENTS.md outperforms skills in our agent evals"

devonkelley — Fri, 30 Jan 2026 05:14:43 +0000

Dude I am not AI. Real human. Just started on HN.

New comment by devonkelley in "Moltworker: a self-hosted personal AI agent, minus the minis"

devonkelley — Fri, 30 Jan 2026 04:08:14 +0000

The prompt injection concerns are valid, but I think there's a more fundamental issue: agents are non-deterministic systems that fail in ways that are hard to predict or debug.

Security is one failure mode. But "agent did something subtly wrong that didn't trigger any errors" is another. And unlike a hacked system where you notice something's off, a flaky agent just... occasionally does the wrong thing. Sometimes it works. Sometimes it doesn't. Figuring out which case you're in requires building the same observability infrastructure you'd use for any unreliable distributed system.

The people running these connected to their email or filesystem aren't just accepting prompt injection risk. They're accepting that their system will randomly succeed or fail at tasks depending on model performance that day, and they may not notice the failures until later.

New comment by devonkelley in "Claude Code daily benchmarks for degradation tracking"

devonkelley — Fri, 30 Jan 2026 03:58:43 +0000

Running agents in production, I've stopped trying to figure out why things degrade. The answer changes weekly.

Model drift, provider load, API changes, tool failures - it doesn't matter. What matters is that yesterday's 95% success rate is today's 70%, and by the time you notice, debug, and ship a fix, something else has shifted.

The real question isn't "is the model degraded?" It's "what should my agent do right now given current conditions?"

We ended up building systems that canary multiple execution paths continuously and route traffic based on what's actually working. When Claude degrades, traffic shifts to the backup path automatically. No alerts, no dashboards, no incident.

Treating this as a measurement problem assumes humans will act on the data. At scale, that assumption breaks.

New comment by devonkelley in "Show HN: Kalibr – Autonomous Routing for AI Agents"

devonkelley — Tue, 27 Jan 2026 17:22:40 +0000

Wow, yes. You nailed the framing. Autonomous control plane is the perfect way to describe Kalibr.

Defining success: We don't normalize it. Teams define their own outcome signals (latency, cost, user ratings, task completion, etc). You don't need perfect attribution to beat static configs; even noisy signals surface real patterns when aggregated correctly.

Oscillation: Thompson Sampling. Instead of greedily chasing the current best path, we maintain uncertainty estimates and explore proportionally. Sparse or noisy outcomes widen confidence intervals, which naturally dampens switching. Wilson scoring handles the low-sample edge cases without the wild swings you'd get from raw percentages.

Confidence/regret: Explicit in the routing math. Every path carries uncertainty that decays with evidence. The system minimizes cumulative regret over time rather than optimizing point-in-time decisions.

The gap we're closing is exactly what you mentioned. Self-correcting instead of babysat.

Show HN: Kalibr – Autonomous Routing for AI Agents

devonkelley — Tue, 27 Jan 2026 16:51:09 +0000

Hey HN, we’re Devon and Alex from Kalibr (https://kalibr.systems).

Kalibr is an autonomous routing system for AI agents. It replaces human debugging with an outcome-driven learning loop. On every agent run, it decides which execution path to use based on what is actually working in production.

An execution path is a full strategy, not just a model: model + tools + parameters.

Most agents hardcode one path. When that path degrades or fails, a human has to notice, debug, change configs, and redeploy. Even then, the fix often doesn’t stick because models and tools keep changing.

I got tired of being the reliability layer for my own agents. Kalibr replaces that.

With Kalibr, you register multiple paths for a task. You define what success means. After each run, your code reports the outcome. Kalibr captures telemetry on every run, learns from outcomes, and routes traffic to the path that’s working best while continuously canarying your alternative paths. When one path degrades or fails, traffic shifts immediately. No alerts, no dashboards and no incident response.

How is this different from other routers or observability tools?

Most routers choose between models using static rules or offline benchmarks. Observability tools show traces and metrics but still require humans to act. Kalibr is outcome-aware and autonomous. It learns directly from production success and changes runtime behavior automatically. It answers not “what happened?” but “what should my agent do next?”

We’re not a proxy. Calls go directly to OpenAI, Anthropic, or Google. We’re not a retry loop. Failed paths are routed away from, not retried blindly. Success rate always dominates; cost and latency only matter when success rates are close.

Python and TypeScript SDKs. Works with LangChain, CrewAI, and the OpenAI Agents SDK. Decision latency is ~50ms. If Kalibr is unavailable, the Router falls back to your first path.

Think of it as if/else logic for agents that rewrites itself based on real production outcomes.

We’ve been running this with design partners and would love feedback. Always curious how others are handling agent reliability in production.