Hacker News: dacharyc

New comment by dacharyc in "Agent Reading Test"

dacharyc — Tue, 07 Apr 2026 13:54:16 +0000

I suspect this is a result of relevance-based retrieval. In my colleague's testing, they found that sometimes the content comes back out of order or not at all, depending on the implementation's interpretation of which chunks of content were "relevant" to the query that accompanied the fetch result. I was surprised to find out some agents do this - when I started down this rabbit hole, I assumed they either returned some number of characters in order or did some sort of summary.

So many different implementations out there!

New comment by dacharyc in "Agent Reading Test"

dacharyc — Tue, 07 Apr 2026 13:49:28 +0000

Ah, good point - this was intended to be a bonus point for agents that do not use a working browser, to evaluate whether they understood and communicated that the content was missing. But it should be an either/or - not a missed point for agents that do use a working browser. Thanks for pointing this out, I'll update it!

New comment by dacharyc in "Agent Reading Test"

dacharyc — Tue, 07 Apr 2026 01:11:24 +0000

Yeah, my colleague and I have been seeing in testing how much this is actually a problem in practice. It has been - surprising, and a little dismaying - how much this negatively impacts content retrieval and results in poor UX.

New comment by dacharyc in "Agent Reading Test"

dacharyc — Tue, 07 Apr 2026 01:09:22 +0000

Hah, I actually originally had some stuff in the site that Claude Code's summarization agent (presumably Haiku) thought was prompt injection, and refused to give content to the foreground agent I was working with. I had to remove some stuff from the site to work around that. Of course implementation will vary and not all platforms have the same safety stuff in place around this yet, so there's probably some interesting stuff to do there.

New comment by dacharyc in "Agent Reading Test"

dacharyc — Mon, 06 Apr 2026 21:29:42 +0000

Hah, that's actually what drove me to try to create this to begin with. I've been writing a lot about these issues, and someone said to me:

> It'd be nice to have a test harness: "Test my agent," to score them and give you benchmark score (like graphics cards, etc.). > Agent XYZ: reads only X% of the content it accesses.

I synced up with a colleague of mine who is testing the platform retrieval behaviors across platforms right now, and writing about them at: https://rhyannonjoy.github.io/agent-ecosystem-testing/

The info we have so far isn't consistent enough for a standardized benchmark, but it's on our radar to produce something like this in the future as we hone in on how to assess this more consistently, or at least how to compare outputs in a more standardized way.

New comment by dacharyc in "Agent Reading Test"

dacharyc — Mon, 06 Apr 2026 21:26:02 +0000

Yeah, good call, we're on the same page about that. I designed this tool (agentreadingtest.com) to raise awareness of these issues in a more general way, so people can point agents at it and see how it performs for them. Separately, I maintain a related tool that can actually assess these issues in documentation sites: https://afdocs.dev/

My weighting system there scores the number of pages affected by SPA and caps the possible score at a "D" or "F" depending on the proportion of pages affected: https://afdocs.dev/interaction-diagnostics.html#spa-shells-i...

I've tried to weight things appropriately in assessing actual sites, but for the test here, I more wanted to just let people see for themselves what types of failures can occur.

New comment by dacharyc in "Agent Reading Test"

dacharyc — Mon, 06 Apr 2026 21:21:40 +0000

Hey there - I'm the test author, and you've hit on one of the main points. For the summarization/relevance-based content return, this is a consideration for some of the agent platforms (although I've found others actually do better here than I expected!) - which is part of the point I'm trying to drive home to folks who aren't as familiar with these systems.

I chose to structure it this way intentionally because this is the finding. Most people are surprised that agents aren't 'seeing' everything that's there, and get frustrated when an agent says something isn't there when it clearly is. Raising awareness of this is one of the main points of the exercise, to me.

New comment by dacharyc in "Launch HN: Reploy (YC S20) – Instant fullstack staging environments for web apps"

dacharyc — Thu, 23 Jul 2020 15:25:22 +0000

Actually, according to my count, there are 9 other companies in this space that offer deploy environments as a service, in addition to 6 platforms that also offer this functionality as part of the platform.

I've been watching this space closely, as my org is one of the competitors (https://www.tugboat.qa/) that has been around for a few years. We started out as an internal tool in a development agency in 2012, and released this service as a product in 2016.

It's interesting to see all the new offerings in this space in the past 12 months. There's definitely an education problem, so it's probably good for all of us that so many new services are popping up that offer this functionality, therefore drawing attention to the fact that it's even an option. But the space is getting pretty saturated at this point, which makes it more difficult for newcomers.

New comment by dacharyc in "Show HN: On-demand staging environments for web apps"

dacharyc — Thu, 07 May 2020 21:25:50 +0000

There are a bunch of services that already do this in some permutation or another:

Tugboat: https://tugboat.qa/ Squash: https://www.squash.io/ Dockup: https://getdockup.com/ Release: https://www.releaseapp.io/ FeaturePeek: https://featurepeek.com/ Valist: https://www.valist.dev/

Some of them obfuscate the Docker aspect, or use different technologies (GCP, k8s), but this seems to be an area that is getting well-covered. And that doesn't consider similar functionality integrated in big platforms, like Netlify Deploy Previews, GitLab Review Apps, etc.