Hacker News: chirdeeps

New comment by chirdeeps in "Show HN: Output.ai - OSS framework we extracted from 500+ production AI agents"

chirdeeps — Sat, 11 Apr 2026 10:13:56 +0000

Extracting a framework from production experience is the right way to build one. The failure modes you encode are the ones that actually matter.

Curious what the three failure modes were that caused the most incidents before you extracted the framework, those tend to reveal the assumptions baked into the design.

New comment by chirdeeps in "Google open-sources experimental agent orchestration testbed Scion"

chirdeeps — Sat, 11 Apr 2026 10:08:15 +0000

The failure mode most underrepresented in agent testbeds is cascading failure, what happens when individually correct agents interact in ways that produce collectively incorrect outcomes. Most testing focuses on individual agent behaviour.

Does the testbed have a model for multi-agent state conflicts, can you simulate two agents concurrently modifying the same resource and observe the resolution behaviour?

New comment by chirdeeps in "Ask HN: How are teams productionizing AI agents today?"

chirdeeps — Sat, 14 Mar 2026 13:47:14 +0000

The biggest thing that surprised us: the constraint shifts from intelligence to reliability the moment agents start modifying shared systems. In a PoC, a wrong action is a failed experiment. In production, it's a corrupted customer record, a duplicated invoice, or a deployment that can't be unwound.The specific properties you need before going live that most frameworks don't give you out of the box: 1. Idempotency — can every agent action be safely retried without duplicating side effects?

2. Rollback semantics — if a multi-step workflow fails, what unwinds?

3. Authority boundaries — what can each agent do without human approval, and what requires sign-off?

4. An authoritative action history — when something goes wrong, can you reconstruct exactly what happened and why, without stitching together logs from five different systems?

Most teams discover these requirements after the first production incident. The teams that define them in an execution layer before going live have a much smoother transition. Keen to check out what you will be sharing in the session.

New comment by chirdeeps in "Ask HN: How are you testing AI agents before shipping to production?"

chirdeeps — Sat, 14 Mar 2026 00:14:59 +0000

This thread is an incredible resource for adversarial security testing, but I'd love to pull on the "Cascade failures" (#5) thread from the original post, because that's what actually takes down production systems most often.We spend so much time testing if the model will break, and almost no time testing if the workflow can recover when the model inevitably does break. If an agent is executing a 4-step sequence and fails on step 3, how do you test what happens next? Does it orphan the data from steps 1 and 2? Does it infinitely retry and duplicate records?The biggest gap in agent testing right now is that we test agents like they are stateless functions, when in reality they are long-running stateful processes. You can't just test the prompt; you have to test the system's idempotency. If you can't safely kill an agent mid-task and restart it without corrupting your database, the system isn't production-ready, regardless of how robust your prompt injection firewall is. Please do share the framework, curious where we miss the point - the surface si ever expanding post Openclaw.

New comment by chirdeeps in "Ask HN: How are you monitoring AI agents in production?"

chirdeeps — Sat, 14 Mar 2026 00:08:08 +0000

The distinction between reversible and irreversible actions mentioned here is crucial, but there's an organizational layer to this problem that most monitoring tools miss entirely.When you scale past a single team, you inevitably end up with a fragmented stack. Team A builds a support bot in LangGraph, Team B builds a research agent in CrewAI, and Team C writes raw Python against the Anthropic API. If you rely on framework-level monitoring or prompt-level guardrails, your audit trail is completely fractured. You can't confidently tell a compliance officer what your synthetic workforce is doing.We realized that observability and governance cannot live inside the agent framework. They have to live in an independent execution layer that sits between the agents and your business systems. The agent proposes an intent, but the execution layer acts as the system of record—verifying authority, checking budget, and logging the action—before the API call is ever allowed to hit your database.

New comment by chirdeeps in "How are people debugging multi-agent AI workflows in production?"

chirdeeps — Sat, 14 Mar 2026 00:05:01 +0000

OpenTelemetry and standard observability stacks are great for seeing the latency and token counts of individual LLM calls, but they break down when you try to debug the coordination between agents.The hardest failure mode we've had to debug isn't a single agent hallucinating; it's Agent A correctly doing its job, but passing slightly malformed state to Agent B, which then confidently executes a destructive action based on that bad state. By the time you see the error, the root cause is three steps up the chain.Tracing doesn't solve this because it just shows you the execution path, not the authority boundary. What you actually need is a way to enforce contracts between agents—an execution layer that says "Agent B cannot accept this payload from Agent A unless it meets X criteria, and if it fails, rollback Agent A's last action." Until we treat multi-agent systems as concurrent state machines rather than just chained API calls, debugging them is going to remain a nightmare.