Hacker News: bisonbear

New comment by bisonbear in "Claude Opus 5"

bisonbear — Sat, 25 Jul 2026 01:11:50 +0000

Also working on a product to build tasks from your own work for testing coding agents. Main thing I would offer is to look carefully at the agent trajectories - they love to figure out ways to cheat. Additionally, consider what "winning" means. If just using test pass rate, consider that tests might not encode what good means in your repo. I have been having success using "equivalence with merged PR" as judged by an LLM as a signal.

New comment by bisonbear in "Why Software Factories Fail (or: harness engineering is not enough)"

bisonbear — Fri, 24 Jul 2026 03:02:37 +0000

To me, this comes down to verifiability. How do we measure the quality of what an agent is doing on our codebase rather than simply measuring task accomplishment?

> Verifying quality is orders of magnitude harder than "did the tests pass"

Agree that agentic grading is the future here. Cognition's Frontier Code is probably the best large public benchmark at this. You attribute agent quality issues to RLVR's binary pass/fail, however I wouldn't be surprised if labs are already supplementing that with rubrics as rewards to train more 'tasteful' models like Fable.

What can a practitioner do? I think there's promise in turning the optimization machine to the harness itself - building out a representative dataset of tasks on your repo, grading agent quality on them across various configurations, and optimizing [AGENTS.md / SKILLS.md / workflow / model / harness / tools] on that signal. High quality grading is still very hard, but it's more tractable at smaller, repo-level scale, and you can afford slower, more expensive verification for each task. You only need it to be right about your codebase's standards.

> In fact, it's not hard to imagine that if a model could reliably tell good code from bad, it might have written the good version to begin with

Pushing back slightly - detecting slop and discriminating quality is easier than generating it (why code review is so effective), and why grading is viable at repo eval scale even if it's much harder at RL scale.

Everyone is flying blind. For example, I am genuinely interested in trying HumanLayer, but would likely want some harder evidence (beyond anecdotes) that it's actually making my agents more effective before rolling out to an enterprise team.

I'm building this harness optimization loop @ https://stet.sh if curious

I ran Sonnet 5 vs. Opus 4.8 head to head on 24 tasks to see what's different

bisonbear — Wed, 15 Jul 2026 14:38:05 +0000

Article URL: https://www.stet.sh/blog/sonnet-5-vs-opus-4-8-reasoning-dial

Comments URL: https://news.ycombinator.com/item?id=48921537

Points: 1

# Comments: 0

New comment by bisonbear in "How are you measuring Claude Code and Codex performance?"

bisonbear — Wed, 08 Jul 2026 04:25:17 +0000

It depends on what you're measuring. I agree that model resourcefulness is useful, but if you're trying to simulate real user sessions, then Claude looking at upstream Git and fetching the answer directly is somewhat worthless.

In my case, I'm trying to measure how coding agents perform under realistic scenarios when implementing tasks, as a proxy for how agents perform when used by actual users for those same tasks, so it's important to ensure the agents are behaving realistically instead of "cheating" and looking up answers.

Happy to share resources! I've been pretty deep in the space :)

New comment by bisonbear in "How are you measuring Claude Code and Codex performance?"

bisonbear — Tue, 07 Jul 2026 19:54:58 +0000

as a tip - models will always find a way to cheat, you will probably need to impose some restrictions on what they do / are able to access in the sandbox environment

see https://cursor.com/blog/reward-hacking-coding-benchmarks

New comment by bisonbear in "How are you measuring Claude Code and Codex performance?"

bisonbear — Mon, 06 Jul 2026 23:22:29 +0000

I've actually been working on a solution for this problem! https://www.stet.sh/

At a high level, it

- Mines tasks from your merged PRs/commits - Replays them in Docker containers with different harness settings (change model / reasoning effort / AGENTS.md / etc) - Grades the patches on various attributes (tests, equivalence with human patch, code quality)

The goal is to get a sense of how agents perform on your tasks, with your context, using the tools you do.

This is currently one-shot but I'd definitely like to explore session-based benchmarks as well. There are some interesting papers that just came out on this https://arxiv.org/abs/2606.29957 https://arxiv.org/abs/2606.30573

I evaluated GLM 5.2 against the frontier on tasks from real repos

bisonbear — Sat, 20 Jun 2026 14:01:01 +0000

Article URL: https://www.stet.sh/blog/glm-5-2-passes-tests-fails-review

Comments URL: https://news.ycombinator.com/item?id=48609306

Points: 2

# Comments: 2

New comment by bisonbear in "Ask HN: What's good for VR these days, free and paid"

bisonbear — Mon, 08 Jun 2026 01:21:26 +0000

beat saber is the only game I play on it and it's incredible

New comment by bisonbear in "Ask HN: Are we as society going to let LLM companies take all the values?"

bisonbear — Sun, 07 Jun 2026 23:27:08 +0000

The most salient point here is the societal acceptance of consuming slop - somehow we've gotten to a point where the majority of people are ok with mediocre art. I feel that this is a trend that AI has only amplified. The commodification of attention has gradually led us to a point where we're optimizing for engagement instead of for intrinsic value of the content itself.

Personally, I will continue seeking out high-quality music/art/movies/books that speak to me, and most of my friends do the same. There will always be a demand for human-created art, regardless of any plagiarism or replication by labs.

New comment by bisonbear in "My Agent Skill for Test-Driven Development"

bisonbear — Fri, 05 Jun 2026 22:54:08 +0000

Agree - all of this is based on vibes (I also use TDD based on vibes FWIW). The only way to settle "does TDD / caveman / [insert random skill here] help" is to replay real PRs from your repo and measure quality

I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos

bisonbear — Wed, 03 Jun 2026 17:06:22 +0000

Article URL: https://www.stet.sh/blog/opus-48-vs-gpt-55-vs-opus-47-vs-composer-25

Comments URL: https://news.ycombinator.com/item?id=48386637

Points: 3

# Comments: 0

New comment by bisonbear in "I used autoresearch to improve my AGENTS.md, measured against real tasks"

bisonbear — Thu, 28 May 2026 04:05:08 +0000

> Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.

This is also generally where I've landed - keep the AGENTS.md super light, and link out to docs as needed. Same idea with skills as well. Basically, preserve the context window at all costs.

The part I'm curious about is, when we're making the sorts of behavior changes you're describing on shared repos, how do we actually measure and quantify impact? It's one thing to tell the team that the agent should perform better, and it's another to say that you made the agent 5% better across a variety of tasks for every dev in the repo.

New comment by bisonbear in "I used autoresearch to improve my AGENTS.md, measured against real tasks"

bisonbear — Thu, 28 May 2026 04:01:59 +0000

> we lack common tools to assess and compare

This has been bothering me for a while - the entire dev community is running on vibes when talking about AI. We're operating in an old paradigm, thinking that smart and logical additions to AGENTS.md result in good agent behavior, when in fact agents behavior is such a black box, that measurement is necessary.

> Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)

Even the rigging is hard to control - Anthropic has an interesting piece on this here https://www.anthropic.com/engineering/infrastructure-noise

New comment by bisonbear in "I used autoresearch to improve my AGENTS.md, measured against real tasks"

bisonbear — Thu, 28 May 2026 03:58:26 +0000

Yes, agree that low n makes overclaiming a real risk with this sort of optimization loop. Low n results can be useful directionally but can't claim superiority without expanding the dataset. If I were running this for a shared repo with real consequences / value to improving AGENTS.md, instead of just as an experiment, I would expand n by a few factors for training / holdout, depending on expected variation on the tasks.

I'm also noticing similar patterns with needing to update AGENTS.md / skills per model release. E.g with Opus 4.6 -> 4.7, it became much more instruction adherent, so instructions written for the prior model generation might cause unexpected behavior in the new generation. I'm also convinced that an optimal AGENTS.md for Codex is not the same file as an optimized CLAUDE.md for Claude - the model personalities and behaviors are so different that we probably need to tune the instructions differently as well.

I used autoresearch to improve my AGENTS.md, measured against real tasks

bisonbear — Wed, 27 May 2026 19:56:09 +0000

Article URL: https://www.stet.sh/blog/how-i-used-codex-to-improve-its-own-agents-md

Comments URL: https://news.ycombinator.com/item?id=48299687

Points: 8

# Comments: 7

A brief investigation into the GPT-5.5 regression claims

bisonbear — Tue, 19 May 2026 19:39:32 +0000

Article URL: https://www.stet.sh/blog/gpt-55-high-regression-check-graphql-go-tools

Comments URL: https://news.ycombinator.com/item?id=48198356

Points: 1

# Comments: 0

New comment by bisonbear in "Ask HN: Do you still spend time maintaining Claude.md / AGENTS.md files?"

bisonbear — Sat, 16 May 2026 17:26:41 +0000

Yeah, I've found that to be more effective. Going with the example "Always clarify intent before acting" > "Never act without getting intent first", seemingly because telling the agent NOT to do something sometimes primes it to do that exact thing

New comment by bisonbear in "Ask HN: Do you still spend time maintaining Claude.md / AGENTS.md files?"

bisonbear — Sat, 16 May 2026 16:20:16 +0000

My advice, from doing this myself and reading best practices, would be:

- Keep it concise, use progressive disclosure / nested AGENTS.md for information expansion - Give agent the high level repo structure if necessary - Have a "why" section to align the agent, high level, what your code is doing - Keep behavior instructions positive where possible, eg Always clarify intent before acting

New comment by bisonbear in "Ask HN: Do you still spend time maintaining Claude.md / AGENTS.md files?"

bisonbear — Sat, 16 May 2026 15:12:40 +0000

AGENTS.md is extremely important - it's probably the highest leverage thing you can give your agent. It's injected into every turn, and the agents are trained to follow instructions. If anything, I think people are under-investing into AGENTS.md and going purely based on vibes.

For example, if I write a bad AGENTS.md for a repo with 100 engineers actively working in it, then every agent for every engineer gets worse, without anyone really noticing.

I think we should move towards data-based tuning of AGENTS.md, testing out changes, gathering data, and then making a decision on whether or not to ship it.

New comment by bisonbear in "Ask HN: How do you catch regressions when you change your AI agent's prompt?"

bisonbear — Sat, 16 May 2026 14:47:08 +0000

I've been building a tool to do this - build a dataset based on tasks from your repo, then A/B test the agent with whatever change you're making to determine the impact prior to actually shipping it. If you want to check it out - stet.sh