Hacker News: oliver_dr

New comment by oliver_dr in "Show HN: Armalo AI – The Infrastructure for Agent Networks"

oliver_dr — Wed, 11 Mar 2026 23:23:26 +0000

The framing of "benchmarks measure capability, we measure reliability" resonates. The industry has been so focused on making agents more capable that reliability infrastructure has lagged significantly.

One gap I'd push on: PactScore measures behavioral dimensions (task completion, policy compliance, latency, safety, peer attestation), but doesn't seem to address output factual accuracy - did the agent give a correct answer, not just a compliant one? An agent can complete a task, comply with policy, respond quickly, pass safety checks, and still hallucinate the answer.

For multi-agent systems this is even more critical because errors compound. If Agent A hallucinates a fact and Agent B builds on it, the cascade looks "reliable" by behavioral metrics but produces garbage outputs. You'd want an accuracy/groundedness dimension in PactScore that evaluates whether the agent's outputs are factually correct relative to the source data it was given.

The on-chain trust verification makes sense for the multi-party trust problem you're describing. Curious about the latency profile in practice - how does a sub-second score lookup via REST API compare to the latency of the agent task itself? For real-time agent workflows, even 100ms of trust-checking overhead per delegation could add up in deep call chains.

New comment by oliver_dr in "Show HN: Vigil – Zero-dependency safety guardrails for AI agent tool calls"

oliver_dr — Wed, 11 Mar 2026 22:34:42 +0000

Nice approach. The "don't trust an LLM to guard another LLM" principle is sound for tool-call safety specifically, where the threat model is well-defined (destructive commands, SSRF, path traversal, etc.) and pattern matching gives you deterministic guarantees.

Where this gets interesting is at the boundary between tool-call safety and output quality. Vigil solves the "agent tries to rm -rf /" problem, but there's a whole class of failures where the agent makes safe tool calls that produce wrong results - querying the right database but misinterpreting results, calling an API correctly but hallucinating the summary it returns to the user, following the workflow but giving a factually incorrect final answer.

For that layer, deterministic rules don't scale because you'd need rules for every possible factual claim. That's where LLM-based evaluation actually makes sense - not as the guard for tool execution, but as the quality check on the final output. Think of it as two layers:

1. Vigil-style - deterministic, sub-millisecond checks on tool calls and actions (is this safe to execute?)

2. Semantic evaluation - LLM-based scoring on the output (is this correct, complete, and grounded in the provided context?)

The combination of both is what production agent systems actually need. We've been building the second layer at DeepRails (evaluate + auto-remediate when quality checks fail), and something like Vigil would complement it well as the first layer.

One thought on false positives: have you considered a "soft block" mode where flagged-but-borderline calls get routed through human approval rather than hard-blocked? For long-running agent tasks, a hard block with no fallback can leave workflows in broken states.

New comment by oliver_dr in "Tell HN: Crosstalk when using Ollama with cloud DeepSeek models?"

oliver_dr — Wed, 11 Mar 2026 22:33:46 +0000

This is almost certainly a server-side session isolation bug rather than an LLM hallucination - the model is returning a response to someone else's prompt. The DeepSeek cloud endpoints have had documented issues with request routing under load.

That said, it highlights something important: whether the failure mode is hallucination, crosstalk, or retrieval contamination, production systems need output validation that doesn't just check "did the model generate text?" but "is this response actually relevant and correct for this specific query?"

For anyone running cloud-hosted models in production (especially for anything beyond hobby use), a few practical safeguards:

1. Semantic coherence check - verify the response is topically related to your prompt. A simple embedding similarity between prompt and response catches gross crosstalk like medical answers to coding questions.

2. Instruction adherence scoring - evaluate whether the output actually follows the system prompt and user instructions. This catches both hallucinations and routing errors.

3. Ground truth verification for RAG - if you're passing context documents, verify the response is grounded in those documents and not pulling from cached state of another request.

This is exactly the class of failure that makes self-hosted models attractive for anything with real stakes. When you don't control the inference server, you inherit all its bugs. If you must use cloud inference, wrapping it with an output evaluation layer is the minimum.

New comment by oliver_dr in "Ask HN: What are you using to mitigate prompt injection?"

oliver_dr — Wed, 11 Mar 2026 22:32:49 +0000

We've been dealing with this at multiple layers. Here's what actually works in production:

Input-side (preventing injection):

- Strict input sanitization with role-boundary enforcement in the system prompt. Sounds basic, but most people skip it.

- Separate "user content" from "system instructions" at the API level. Don't concatenate untrusted input into your system prompt. Use the dedicated `user` role in the messages array.

- For tool-calling agents, validate that tool arguments match expected schemas before execution. An LLM-as-judge approach for tool call safety is expensive but effective for high-stakes actions.

Output-side (catching when injection succeeds):

This is the part most people underinvest in. Even with perfect input filtering, you still need output guardrails:

- Run the LLM output through evaluation metrics that score for factual correctness, instruction adherence, and safety before it reaches the user.

- For RAG systems specifically, verify that the generated answer is actually grounded in the retrieved context, not fabricated or influenced by injected instructions.

The "defense in depth" framing matters here. Input filtering alone has a ceiling because adversarial prompts evolve faster than regex rules. Output evaluation catches the failures that slip through. We use DeepRails' Defend API for this layer - it scores outputs on correctness, completeness, and safety, then auto-remediates failures before they reach end users. But the principle applies regardless of tooling: treat output verification as a first-class concern, not an afterthought.

Simon Willison's work on dual-LLM patterns is also worth reading if you haven't: https://simonwillison.net/2023/Apr/25/dual-llm-pattern/