Hacker News: aluzzardi

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sun, 03 May 2026 07:28:42 +0000

Thank you, appreciate it!

Regarding scoping: In our case, the agent loop runs in the same way as our API server does (as in, it’s a multi tenant service running in a container somewhere). And we solve scoping in the same way.

To put it in other words, whether it’s the API receiving “GET /memories/id” or the LLM requesting “Read(/memories/id)” we do pretty much the same thing (check authN/authZ, scope the db request, etc).

Basically the LLM is just another API client using a slightly different format for inputs and outputs, but sharing the same permission layer.

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sun, 03 May 2026 01:05:52 +0000

Author here.

I should have made it more clear that the article is about agent / harness building (not about running third party agents).

> I barely trust the harness more than the LLM

Since we built it, I trust it just as much as I trust our API server :)

The latter gets untrusted inputs from the internet, while the former gets untrusted inputs from the LLM

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sun, 03 May 2026 00:57:09 +0000

Author here.

I’m worried about the same (models tuned for specific harnesses).

We actually work around that by respecting the “contract”. For instance, our harness’ Bash signature is exactly the same as Claude’s. We do our sandboxing stuff and respond using the same format.

In the “eyes” of the model there’s no difference between what Claude does and what we do (even though the implementation is completely different).

We basically use Claude’s tools as API contract

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sun, 03 May 2026 00:52:26 +0000

Author here.

This is an interesting and novel field, so I’m not pretending I know the answers, but this is what worked for us :)

At the end of the day, and oversimplifying things: why would I want to spawn a for loop that calls an API (LLM) into its own dedicated sandbox/computer?

When the model wants to run a command, it’ll tell you so. Doesn’t need to be a local exec, you can run it anywhere, the model won’t know the difference.

The agent loop itself doesn’t need sandboxing. In many cases, most tool calls don’t require sandboxing either. For the tools that do require a computer, you can route those requests there when needed, rather than running the whole software in that sandbox.

To me running the agent loop in the sandbox itself feels like “you should run your API in your DB container because it’ll talk to it at some point”.

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sun, 03 May 2026 00:16:26 +0000

Author here.

I think the confusion is that “agent” is used for two very different things:

- building an agent

- an “agent” product/runtime (Claude Code, etc)

In the first case, the model never executes anything. It just outputs something like “call this API”. Your code is the one doing it, with whatever validation you want. There’s no need for a sandbox there because there’s no arbitrary execution.

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sat, 02 May 2026 23:47:21 +0000

Author here.

In my opinion, the main driver here is how fast models have evolved in the past 12 months. It makes the architecture of everything around them obsolete, very fast.

We went from using models as a building block, wrapping them in heavy workflow code, to now models being smart enough to drive their own workflows and planning.

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sat, 02 May 2026 22:40:14 +0000

Author here. Because of parallelism and non determinism.

This problem is quite common and not limited to memories. For instance, Claude Code will block write attempts and steer the agent to perform a read first (because the file might have been modified in the meantime by the user or another agent).

Same principle here: rather than trying to deterministically “merge” concurrent writes, you fail the last write and let the agent read again and try another write

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sat, 02 May 2026 22:32:12 +0000

Author here. My definition is: you take an agent, remove the model and you’re left with the harness.

Tools, memories, sandboxing, steering, etc

New comment by aluzzardi in "The agent harness belongs outside the sandbox"

aluzzardi — Sat, 02 May 2026 22:19:37 +0000

Author here. Depending on how it’s designed, the harness itself doesn’t need any sandboxing.

At the end of the day, it’s a “simple” loop that calls an external API (LLM) and receives requests to execute stuff on its behalf.

It’s not the agent running bash commands: you (the harness author) are, and you’re in full control of where and how those commands get executed.

In the article’s case, bash commands are forwarded to a sandbox, nothing ever runs on the harness itself (it physically can’t, local execution is not even implemented in the harness).

We Built Our AI Agent

aluzzardi — Thu, 12 Mar 2026 15:33:13 +0000

Article URL: https://www.mendral.com/blog/how-we-built-our-ai-agent

Comments URL: https://news.ycombinator.com/item?id=47352272

Points: 2

# Comments: 0

Our Agent's Most Important Job Is Deciding Not to Think

aluzzardi — Fri, 06 Mar 2026 15:52:01 +0000

Article URL: https://www.mendral.com/blog/agent-orchestration-model-hierarchy

Comments URL: https://news.ycombinator.com/item?id=47276519

Points: 4

# Comments: 0

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Sat, 28 Feb 2026 04:25:03 +0000

It started with Sonnet 4.0 as a single agent and now it’s a mix of Opus 4.6 and Haiku 4.5 agents.

Opus plans the investigation and orchestrates the searches.

Haiku is the one actually querying ClickHouse and returning relevant bits

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Fri, 27 Feb 2026 23:07:08 +0000

> it's not magic and you need to make the job of the agent easier by giving it good instructions, tools, and environments.

This. We had much better success by letting the agent pull context rather trying to push what we thought was relevant.

Turns out it's exactly like a human: if you push the wrong context, it'll influence them to follow the wrong pattern.

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Fri, 27 Feb 2026 19:33:02 +0000

There are 2 layers of compression:

- ZSTD (actual data compression)

- De-duplication (i.e. what you're saying)

Although AFAIK it's not "just point to it" but rather storing sorted data and being able to say "the next 2M rows have the same PR Title"

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Fri, 27 Feb 2026 18:48:20 +0000

Mendral co-founder and post author here.

I agree with your statement and explained in a few other comments how we're doing this.

tldr:

- Something happens that needs investigating

- Main (Opus) agent makes focused plan and spawns sub agents (Haiku)

- They use ClickHouse queries to grab only relevant pieces of logs and return summaries/patterns

This is what you would do manually: you're not going to read through 10 TB of logs when something happens; you make a plan, open a few tabs and start doing narrow, focused searches.

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Fri, 27 Feb 2026 18:26:07 +0000

From our experience running this, we're seeing patterns like these:

- Opus agent wakes up when we detect an incident (e.g. CI broke on main)

- It looks at the big picture (e.g. which job broke) and makes a plan to investigate

- It dispatches narrowly focused tasks to Haiku sub agents (e.g. "extract the failing log patterns from commit XXX on job YYY ...")

- Sub agents use the equivalent of "tail", "grep", etc (using SQL) on a very narrow sub-set of logs (as directed by Opus) and return only relevant data (so they can interpret INFO logs as actually being the problem)

- Parent Opus agent correlates between sub agents. Can decide to spawn more sub agents to continue the investigation

It's no different than what I would do as a human, really. If there are terabytes of logs, I'm not going to read all of them: I'll make a plan, open a bunch of tabs and surface interesting bits.

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Fri, 27 Feb 2026 18:11:20 +0000

> My experience with LLM generated SQL in OLTP and OLAP platforms has been a mixed bag

Models are evolving fast. If your experience is older than a few months, I encourage you to try again.

I mean this with the best intentions: it's seriously mind boggling. We started doing this with Sonnet 4.0 and the relevance was okay at best. Then in September we shifted to Sonnet 4.5 and it's been night and day.

Every single model released since then (Opus 4.5, 4.6) has meaningfully improved the quality of results

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Fri, 27 Feb 2026 18:01:58 +0000

We've actually started to gather metrics this week to write that exact post :) Coming soon!

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Fri, 27 Feb 2026 17:55:34 +0000

Mendral co-founder here and author of the post.

This is an interesting approach. I definitely agree with the problem statement: if the LLM has to filter by error/fatal because of context window constraints, it will miss crucial information.

We took a different approach: we have a main agent (opus 4.6) dispatching "log research" jobs to sub agents (haiku 4.5 which is fast/cheap). The sub agent reads a whole bunch of logs and returns only the relevant parts to the parent agent.

This is exactly how coding agents (e.g. Claude Code) do it as well. Except instead of having sub agents use grep/read/tail, they use plain SQL.

New comment by aluzzardi in "We gave terabytes of CI logs to an LLM"

aluzzardi — Fri, 27 Feb 2026 16:36:34 +0000

Post author here.

Yes, it works really well.

1) The latest models are radically better at this. We noticed a massive improvement in quality starting with Sonnet 4.5

2) The context issue is real. We solve this by using sub agents that read through logs and return only relevant bits to the parent agent’s context