<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: tonyww</title><link>https://news.ycombinator.com/user?id=tonyww</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sun, 03 May 2026 10:48:38 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=tonyww" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by tonyww in "Show HN: Browser Harness – Gives LLM freedom to complete any browser task"]]></title><description><![CDATA[
<p>Browser use is a token hog</p>
]]></description><pubDate>Sun, 26 Apr 2026 01:21:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=47906391</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=47906391</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47906391</guid></item><item><title><![CDATA[Show HN: A 3-line wrapper that enforces deterministic security for AI agents]]></title><description><![CDATA[
<p>If you are building AI agents with frameworks like browser-use, LangChain, or OpenClaw, you've likely hit the "blast radius" problem.<p>A misconfigured prompt or hallucination can cause an agent to navigate to a phishing domain, expose an API key, or confidently claim a task succeeded when it actually clicked a disabled button.<p>The standard fix right now is "LLM-as-a-judge"—taking a screenshot after the fact and asking GPT-4, "Did this work and is it safe?" That introduces massive latency, burns tokens, and is fundamentally probabilistic.<p>We built predicate-secure to fix this.<p>It’s a drop-in Python wrapper that adds a deterministic physics engine to your agent's execution loop.<p>In 3 to 5 lines of code, without rewriting your agent, it enforces a complete three-phase loop:<p>Pre-execution authorization:<p>Before the agent's action hits the OS or browser, it is intercepted and evaluated against a local, fail-closed YAML policy. (e.g., Allow browser.click on button#checkout, Deny fs.read on ~/.ssh/*).<p>Action execution:<p>The agent executes the raw Playwright/framework action.<p>Post-execution verification:<p>It mathematically diffs the "Before" and "After" states (DOM or system) to prove the action succeeded.<p>To avoid the "LLM-as-a-judge" trap, the execution of the verification is purely mathematical. We use a local, offline LLM (Qwen 2.5 7B Instruct) strictly to generate the verification predicates based on the state changes (e.g., asserting url_contains('example.com') or element_exists('#success')), and then the runtime evaluates those predicates deterministically in milliseconds.<p>The DX looks like this:<p>from predicate_secure import SecureAgent from browser_use import Agent<p>1. Your existing unverified agent<p>agent = Agent(task="Buy headphones on Amazon", llm=my_model)<p>2. Drop-in the Predicate wrapper<p>secure_agent = SecureAgent( agent=agent, policy="policies/shopping.yaml", mode="strict" )<p>3. Runs with full Pre- & Post-Execution Verification<p>secure_agent.run()<p>We have out-of-the-box adapters for browser-use, LangChain, PydanticAI, OpenClaw, and raw Playwright.<p>Because we know developers hate giving external SaaS tools access to their agent's context, the entire demo and verification loop runs 100% offline on your local machine (tested on Apple Silicon MPS and CUDA).<p>For enterprise/production fleets, the pre-execution gate can optionally be offloaded to our open-source Rust sidecar (predicate-authorityd) for <1ms policy evaluations.<p>The repo is open-source (MIT/Apache 2.0). We put together a complete, offline demo showing the wrapper blocking unauthorized navigation and verifying clicks locally using the Qwen 7B model.<p>Repo and Demo: <a href="https://github.com/PredicateSystems/predicate-secure" rel="nofollow">https://github.com/PredicateSystems/predicate-secure</a><p>Another demo for securing your OpenClaw:<p><a href="https://github.com/PredicateSystems/predicate-claw" rel="nofollow">https://github.com/PredicateSystems/predicate-claw</a><p>Demo (GIF):<p><a href="https://github.com/PredicateSystems/predicate-claw/blob/main/examples/integration-demo/demo.gif" rel="nofollow">https://github.com/PredicateSystems/predicate-claw/blob/main...</a><p>I'd love to hear what the community thinks about deterministic verification vs. probabilistic LLM judges, or answer any questions about the architecture!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47365599">https://news.ycombinator.com/item?id=47365599</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 13 Mar 2026 15:14:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=47365599</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=47365599</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47365599</guid></item><item><title><![CDATA[New comment by tonyww in "Show HN: Predicate-Claw – Run Time Assurance (RTA) for OpenClaw via Rust Sidecar"]]></title><description><![CDATA[
<p>AI agents currently operate on a flawed security model: they inherit the ambient permissions of the terminal they are spawned in. If an agent gets prompt-injected or hallucinates, a broad blast radius is guaranteed. I built predicate-claw to fix this. It’s a drop-in security plugin for OpenClaw, paired with a lightweight Rust daemon (predicate-authorityd). The core architecture is essentially defense-grade Run Time Assurance (RTA) applied to LLMs. Since you cannot formally verify a non-deterministic black box, you have to physically decouple the "brain" from the "actuators" and drop a hard-coded, deterministic gatekeeper in the middle. How it works:<p>The Interceptor: We hook into OpenClaw's before_tool_call execution loop. The LLM has no idea the security layer exists.
The Sidecar Gate: The tool request is routed to the local Rust daemon, which evaluates the intent against a deterministic YAML policy (e.g., blocking rm -rf, allowing fs.read only in ./src). It fails closed by default.
The TUI: The daemon ships with a terminal UI to monitor all agent requests, allows, and denies in real-time.
I built this in Rust to get strict memory safety with <1ms of latency overhead. It compiles to a static binary and drops into existing projects with zero friction.<p>Link to GitHub Repo: <a href="https://github.com/PredicateSystems/predicate-claw" rel="nofollow">https://github.com/PredicateSystems/predicate-claw</a><p>Demo (GIF): <a href="https://github.com/PredicateSystems/predicate-claw/blob/main/examples/integration-demo/demo.gif" rel="nofollow">https://github.com/PredicateSystems/predicate-claw/blob/main...</a><p>We already use deterministic post-execution verification for our web agents (DOM snapshot diffing, strictly avoiding the 'LLM-as-judge' trap). Next on the roadmap is bringing that same verifiable state-hashing to the OS level. I’d love to hear your thoughts on the architecture and how you're currently handling local agent sandboxing. Note: If you aren't using OpenClaw, our core engine also supports Python frameworks like LangChain and browser-use in 3 lines of code.<p>You can read the full architecture and see our enterprise fleet management here: <a href="https://predicatesystems.ai/docs/vault" rel="nofollow">https://predicatesystems.ai/docs/vault</a></p>
]]></description><pubDate>Mon, 02 Mar 2026 16:07:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=47219800</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=47219800</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47219800</guid></item><item><title><![CDATA[Show HN: Predicate-Claw – Run Time Assurance (RTA) for OpenClaw via Rust Sidecar]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/PredicateSystems/predicate-claw">https://github.com/PredicateSystems/predicate-claw</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47219794">https://news.ycombinator.com/item?id=47219794</a></p>
<p>Points: 2</p>
<p># Comments: 1</p>
]]></description><pubDate>Mon, 02 Mar 2026 16:06:52 +0000</pubDate><link>https://github.com/PredicateSystems/predicate-claw</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=47219794</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47219794</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>Yeah, that’s a pretty good analogy.<p>The main difference is that the “tests” are predicates over live browser state and are often proposed alongside the plan on the fly, not written upfront by a developer. But conceptually it’s very close: make the expected outcome explicit, try an action, verify, and only move forward if the condition actually holds.</p>
]]></description><pubDate>Wed, 28 Jan 2026 22:13:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=46802326</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46802326</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46802326</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>Absolutely agree on the compounding error point - that’s exactly what pushed us toward verification.<p>On “verification wrong”: we try hard to keep predicates grounded and re-evaluated, not “check a cached handle”. Assertions do re-snapshot / re-query during each retry, and we scope them to signals that should change (URL, existence/state of an element, text/value).<p>If the page is flaky/stale, the assertion just won’t prove the condition within the retry window and we fail with artifacts such as frames of clip (if ffmpeg available) rather than claiming success.<p>There are still edge cases (virtualized DOM, optimistic UI, async updates), but in those cases the goal is the same: make the failure explicit and debuggable with artifacts and time-travel traces, not silently drift.</p>
]]></description><pubDate>Wed, 28 Jan 2026 19:39:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=46800487</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46800487</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46800487</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>It’s mostly the former: there’s a small set of generic checks/primitives, and we choose which ones to apply per step.<p>The binding between “task/step” and “what to verify” can come from either:<p>the user (explicit assertions), or
the planner/executor proposing a post-condition (e.g. “after clicking checkout, URL contains /checkout and a checkout button exists”).<p>But the verifier itself is not an AI, by design it’s predicate-only</p>
]]></description><pubDate>Wed, 28 Jan 2026 19:34:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=46800419</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46800419</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46800419</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>I’m absolutely not AI, I dedicate this morning to technical discussion with HN community on my post, which I’ve spent weeks building the technology behind it</p>
]]></description><pubDate>Wed, 28 Jan 2026 19:26:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=46800324</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46800324</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46800324</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>Totally agree - hybrid approaches can work well, especially on messy pages. We’ve seen the same tradeoff.<p>On the verification side though, dynamic pages are exactly the reason why we scope assertions narrowly (specific predicates, bounded retries using eventually() function) instead of diffing the whole page. If the expected condition can’t be proven within that window, we fail fast rather than guessing.</p>
]]></description><pubDate>Wed, 28 Jan 2026 19:21:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=46800257</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46800257</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46800257</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>Importance ranking is just a heuristic pass that scores/prioritizes elements (size, visibility, role, state) so the snapshot stays small and focused. It’s deterministic, not ML.<p>The verification layer absolutely still exists without it — assertions, predicates, retries, and artifacts all work locally. The API-backed ranking just improves pruning quality on very dense pages, but it’s not required for correctness.<p>You can set use_api = False in the SnapshotOptions to avoid using the api</p>
]]></description><pubDate>Wed, 28 Jan 2026 19:06:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=46800087</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46800087</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46800087</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>The WASM pass is fully deterministic: it’s just code running in the page to extract and prune post-rendered elements (roles, geometry, visibility, layout, etc), no agent involved in the chrome extension .<p>The “deterministic overrides” aren’t generated by a verifier agent either; they’re runtime rules that kick in when assertions or ordinality constraints are explicit (e.g. “first result”). The verifier just checks outcomes — it doesn’t invent actions. Because the nature of ai agents is non-deterministic, which we don’t want to introduce to the verification layer (predicate only).</p>
]]></description><pubDate>Wed, 28 Jan 2026 18:23:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=46799479</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46799479</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46799479</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>Thanks — that’s exactly our motivation. The key shift for us was moving from “did the agent probably do the right thing?” to “can we prove the state we expected actually holds.”<p>The property-based testing analogy is a good one — once you make success explicit, failures become actionable instead of mysterious.</p>
]]></description><pubDate>Wed, 28 Jan 2026 16:42:53 +0000</pubDate><link>https://news.ycombinator.com/item?id=46797777</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46797777</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46797777</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>The accessibility tree is definitely useful, and we do look at it. The issue we ran into is that it’s optimized for assistive consumption, not for action verification or layout reasoning on dynamic SPAs.<p>In practice we’ve seen cases where AX is incomplete, lags hydration, or doesn’t reflect overlays / grouping accurately. It does not support ordinality queries well. That’s why we anchor on post-rendered DOM + geometry and then verify outcomes explicitly, rather than relying on any single representation.</p>
]]></description><pubDate>Wed, 28 Jan 2026 16:37:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=46797675</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46797675</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46797675</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>A quick clarification on intent, since “browser automation” means different things to different people:<p>This isn’t about making scripts smarter or replacing Playwright/Selenium. The problem I’m exploring is reliability: how to make agent-driven browser execution fail deterministically and explainably instead of half-working when layouts change.<p>Concretely, the agent doesn’t just “click and hope”. Each step is gated by explicit post-conditions, similar to how tests assert outcomes:<p>----
## Python Code Example:<p>ready = runtime.assert_(
    all_of(url_contains("checkout"), exists("role=button")),
    "checkout_ready",
    required=True
)<p>----<p>If the condition isn’t met, the run stops with artifacts instead of drifting forward. Vision models are optional fallbacks, not the primary control signal.<p>Happy to answer questions about the design tradeoffs or where this approach falls short</p>
]]></description><pubDate>Wed, 28 Jan 2026 02:58:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=46790507</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46790507</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46790507</guid></item><item><title><![CDATA[A verification layer for browser agents: Amazon case study]]></title><description><![CDATA[
<p>A common approach to automating Amazon shopping or similar complex websites is to reach for large cloud models (often vision-capable). I wanted to test a contradiction: can a ~3B parameter local LLM model complete the flow using only structural page data (DOM) plus deterministic assertions?<p>This post summarizes four runs of the same task (search → first product → add to cart → checkout on Amazon). The key comparison is Demo 0 (cloud baseline) vs Demo 3 (local autonomy); Demos 1–2 are intermediate controls.<p>More technical detail (architecture, code excerpts, additional log snippets):<p><a href="https://www.sentienceapi.com/blog/verification-layer-amazon-case-study" rel="nofollow">https://www.sentienceapi.com/blog/verification-layer-amazon-...</a><p>Demo 0 vs Demo 3:<p>Demo 0 (cloud, GLM‑4.6 + structured snapshots)
  success: 1/1 run
  tokens:  19,956  (~43% reduction vs ~35k estimate)
  time:    ~60,000ms
  cost:    cloud API (varies)
  vision:  not required<p>Demo 3 (local, DeepSeek R1 planner + Qwen ~3B executor)
  success: 7/7 steps (re-run)
  tokens:  11,114
  time:    405,740ms
  cost:    $0.00 incremental (local inference)
  vision:  not required<p>Latency note: the local stack is slower end-to-end here largely because inference runs on local hardware (Mac Studio with M4); the cloud baseline benefits from hosted inference, but has per-token API cost.<p>Architecture<p>This worked because we changed the control plane and added a verification loop.<p>1) Constrain what the model sees (DOM pruning).
We don’t feed the entire DOM or screenshots. We collect raw elements, then run a WASM pass to produce a compact “semantic snapshot” (roles/text/geometry) and prune the rest (often on the order of ~95% of nodes).<p>2) Split reasoning from acting (planner vs executor).<p>Planner (reasoning): DeepSeek R1 (local) generates step intent + what must be true afterward.
Executor (action): Qwen ~3B (local) selects concrete DOM actions like CLICK(id) / TYPE(text).
3) Gate every step with Jest‑style verification.
After each action, we assert state changes (URL changed, element exists/doesn’t exist, modal/drawer appeared). If a required assertion fails, the step fails with artifacts and bounded retries.<p>Minimal shape:<p>ok = await runtime.check(
    exists("role=textbox"),
    label="search_box_visible",
    required=True,
).eventually(timeout_s=10.0, poll_s=0.25, max_snapshot_attempts=3)<p>What changed between “agents that look smart” and agents that work
Two examples from the logs:<p>Deterministic override to enforce “first result” intent: “Executor decision … [override] first_product_link -> CLICK(1022)”<p>Drawer handling that verifies and forces the correct branch: “result: PASS | add_to_cart_verified_after_drawer”<p>The important point is that these are not post‑hoc analytics. They are inline gates: the system either proves it made progress or it stops and recovers.<p>Takeaway
If you’re trying to make browser agents reliable, the highest‑leverage move isn’t a bigger model. It’s constraining the state space and making success/failure explicit with per-step assertions.<p>Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46790127">https://news.ycombinator.com/item?id=46790127</a></p>
<p>Points: 56</p>
<p># Comments: 19</p>
]]></description><pubDate>Wed, 28 Jan 2026 02:08:14 +0000</pubDate><link>https://sentienceapi.com/blog/verification-layer-amazon-case-study</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46790127</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46790127</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>Sorry for the misunderstanding, I intended to post it as news or engineering article, which is why I didn't include *Show HN* in the title</p>
]]></description><pubDate>Thu, 22 Jan 2026 15:20:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=46720388</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46720388</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46720388</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>Good question. On the surface, it does look very similar to the traditional scraper/script, but there's a subtle difference in where the logic lives and how failures are handled.<p>A traditional scraper/script hard-codes selectors and control flow up front. When the layout changes, it usually breaks at an arbitrary line and you debug it manually.<p>In this setup, the agent chooses actions at *runtime* from a bounded action space, and the system uses the built-in predicates (e.g. url_changes, drawer_appeared, etc) to verify the outcomes. When it fails, it fails at a specific semantic assertion with artifacts, not a missing selector.<p>So it’s less “replace scripts” and more “apply test-style verification and recovery to AI-driven decisions instead of static code.”</p>
]]></description><pubDate>Thu, 22 Jan 2026 15:19:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=46720376</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46720376</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46720376</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>yes, the repo is publicly available: <a href="https://github.com/SentienceAPI/sentience-sdk-playground" rel="nofollow">https://github.com/SentienceAPI/sentience-sdk-playground</a>
you can pull it and set up the dependencies including sentience API key, then run the main.py in the planner_executor_local folder</p>
]]></description><pubDate>Thu, 22 Jan 2026 15:15:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=46720305</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46720305</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46720305</guid></item><item><title><![CDATA[New comment by tonyww in "A verification layer for browser agents: Amazon case study"]]></title><description><![CDATA[
<p>One clarification since a few comments from coworkers/friends are circling this: Amazon isn’t the point here.<p>We used it because it’s a dynamic, hostile UI, but the design goal is a site-agnostic control plane. That’s why the runtime avoids selectors and screenshots and instead operates on pruned semantic snapshots + verification gates.<p>If the layout changes, the system doesn’t “half-work” — it fails deterministically with artifacts. That’s the behavior we’re optimizing for.</p>
]]></description><pubDate>Wed, 21 Jan 2026 15:03:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=46706653</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46706653</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46706653</guid></item><item><title><![CDATA[A verification layer for browser agents: Amazon case study]]></title><description><![CDATA[
<p>A common approach to automating Amazon shopping or similar complex websites is to reach for large cloud models (often vision-capable). I wanted to test a contradiction: can a ~3B parameter local LLM model complete the flow using only structural page data (DOM) plus deterministic assertions?<p>This post summarizes four runs of the same task (search → first product → add to cart → checkout on Amazon). The key comparison is Demo 0 (cloud baseline) vs Demo 3 (local autonomy); Demos 1–2 are intermediate controls.<p>More technical detail (architecture, code excerpts, additional log snippets):<p><a href="https://www.sentienceapi.com/blog/verification-layer-amazon-case-study" rel="nofollow">https://www.sentienceapi.com/blog/verification-layer-amazon-...</a><p>Demo 0 vs Demo 3:<p>Demo 0 (cloud, GLM‑4.6 + structured snapshots)
  success: 1/1 run
  tokens:  19,956  (~43% reduction vs ~35k estimate)
  time:    ~60,000ms
  cost:    cloud API (varies)
  vision:  not required<p>Demo 3 (local, DeepSeek R1 planner + Qwen ~3B executor)
  success: 7/7 steps (re-run)
  tokens:  11,114
  time:    405,740ms
  cost:    $0.00 incremental (local inference)
  vision:  not required<p>Latency note: the local stack is slower end-to-end here largely because inference runs on local hardware (Mac Studio with M4); the cloud baseline benefits from hosted inference, but has per-token API cost.<p>Architecture<p>This worked because we changed the control plane and added a verification loop.<p>1) Constrain what the model sees (DOM pruning).
We don’t feed the entire DOM or screenshots. We collect raw elements, then run a WASM pass to produce a compact “semantic snapshot” (roles/text/geometry) and prune the rest (often on the order of ~95% of nodes).<p>2) Split reasoning from acting (planner vs executor).<p>Planner (reasoning): DeepSeek R1 (local) generates step intent + what must be true afterward.
Executor (action): Qwen ~3B (local) selects concrete DOM actions like CLICK(id) / TYPE(text).
3) Gate every step with Jest‑style verification.
After each action, we assert state changes (URL changed, element exists/doesn’t exist, modal/drawer appeared). If a required assertion fails, the step fails with artifacts and bounded retries.<p>Minimal shape:<p>ok = await runtime.check(
    exists("role=textbox"),
    label="search_box_visible",
    required=True,
).eventually(timeout_s=10.0, poll_s=0.25, max_snapshot_attempts=3)<p>What changed between “agents that look smart” and agents that work
Two examples from the logs:<p>Deterministic override to enforce “first result” intent: “Executor decision … [override] first_product_link -> CLICK(1022)”<p>Drawer handling that verifies and forces the correct branch: “result: PASS | add_to_cart_verified_after_drawer”<p>The important point is that these are not post‑hoc analytics. They are inline gates: the system either proves it made progress or it stops and recovers.<p>Takeaway
If you’re trying to make browser agents reliable, the highest‑leverage move isn’t a bigger model. It’s constraining the state space and making success/failure explicit with per-step assertions.<p>Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46706564">https://news.ycombinator.com/item?id=46706564</a></p>
<p>Points: 28</p>
<p># Comments: 8</p>
]]></description><pubDate>Wed, 21 Jan 2026 14:56:32 +0000</pubDate><link>https://www.sentienceapi.com/blog/verification-layer-amazon-case-study</link><dc:creator>tonyww</dc:creator><comments>https://news.ycombinator.com/item?id=46706564</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46706564</guid></item></channel></rss>