Hacker News: kerlenton

New comment by kerlenton in "Show HN: Marmot, context layer for agents and humans"

kerlenton — Mon, 29 Jun 2026 13:07:28 +0000

Makes sense. 3 generic tools + summarize first is a nice approach to sidestepping the problem of too many tools. But it looks like it shifts the chokepoint around rather than eliminating it: instead of "choose the proper tool out of many", it shifts to "formulate the appropriate query to discover_data from the summary".

But in real applications, does the model reliably drill down from the general summary, or does it often just hang around at the level of the summary?

New comment by kerlenton in "Herdr: Agent multiplexer that lives in your terminal"

kerlenton — Mon, 29 Jun 2026 13:01:56 +0000

Multiplexing the agents is clearly the first obvious pain point, but the other one I keep encountering after this is visibility: with multiple agents running, it becomes difficult to see what each of them is doing, what program did they call, and where they are getting stuck until they complete or fail.

Is there any information from Herdr about what each agent is up to beyond the output? Or does it just concentrate on orchestration for the time being?

New comment by kerlenton in "Show HN: Marmot, context layer for agents and humans"

kerlenton — Mon, 29 Jun 2026 12:37:15 +0000

The catalog approach is appropriate for MCP as well. Something I would be interested in: once you have all of your services/APIs/DBs exposed via one MCP server, the next choke point will become the model of selecting the correct tool. After the first dozens of tools, agents select the wrong tool (or nothing) more often than it would be expected.

How does Marmot cope with it? Are all of the tools exposed in a flat way, or there is a scoping/search step which allows an agent to select between only a few tools out of the catalog?

New comment by kerlenton in "Show HN: Ocarina – Automate and test MCP servers from YAML, no LLM"

kerlenton — Mon, 29 Jun 2026 12:32:43 +0000

The deterministic MCP server testing without any LLM in the loop is actually pretty useful for CI. However, the issue that I run into consistently is a level above: the server can do just fine with all the scripted calls but fail in real practice as the model does not even try to use the tool or uses it with some unexpected arguments. This is an issue of behavior that cannot really be described using the YAML file as you write the call yourself.

Are there any means to create cases from real sessions or everything is written by hand?

New comment by kerlenton in "Show HN: I scanned 87 MCP servers for agent-authority hygiene – leaderboard"

kerlenton — Sun, 28 Jun 2026 15:27:04 +0000

A helpful place to start. One item that I would point out is that many authorities exist at the time of running the program rather than in the configuration or tool definition. So what happens on the server (e.g., what data it requests, if it sends sampling data back to the client, etc.) may not be observed by looking at the static manifest. Were you able to score these based on the declared schema, and were you able to confirm those scores by actually running the servers and observing what they request? The gap (between the declared authority and the observed authority) is where many of the risks associated with the project are likely to exist

New comment by kerlenton in "Show HN: Orchid – Local-first record and replay for AI agent debugging"

kerlenton — Sun, 28 Jun 2026 15:18:40 +0000

The approach you described to recording and replaying actions seems interesting. I went at this problem from a different perspective, though. I developed a wiretap transparent proxy that sits in the middle of the JSON-RPC traffic between the client and MCP server; therefore, the record/replay occurs on the wire rather than within the agent. This has the advantage of being able to work with clients you don’t own or can’t instrument (Claude Desktop, Cursor), but it has the disadvantage of only capturing the protocol and not what reasoning the agent has (which your method captures).

How do you deal with replay non-determinism? When I replay a call I captured, I spin up a new server instance, but anything that is stateful, or any time that the model chooses different arguments the second time around makes it difficult to create an accurate repeat of the input and the output. I’m interested to see how Orchid manages that in multi-step execution contexts.