<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: bisonbear</title><link>https://news.ycombinator.com/user?id=bisonbear</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 10 Jun 2026 01:02:43 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=bisonbear" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by bisonbear in "Ask HN: What's good for VR these days, free and paid"]]></title><description><![CDATA[
<p>beat saber is the only game I play on it and it's incredible</p>
]]></description><pubDate>Mon, 08 Jun 2026 01:21:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=48440352</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48440352</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48440352</guid></item><item><title><![CDATA[New comment by bisonbear in "Ask HN: Are we as society going to let LLM companies take all the values?"]]></title><description><![CDATA[
<p>The most salient point here is the societal acceptance of consuming slop - somehow we've gotten to a point where the majority of people are ok with mediocre art. I feel that this is a trend that AI has only amplified. The commodification of attention has gradually led us to a point where we're optimizing for engagement instead of for intrinsic value of the content itself.<p>Personally, I will continue seeking out high-quality music/art/movies/books that speak to me, and most of my friends do the same. There will always be a demand for human-created art, regardless of any plagiarism or replication by labs.</p>
]]></description><pubDate>Sun, 07 Jun 2026 23:27:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=48439707</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48439707</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48439707</guid></item><item><title><![CDATA[New comment by bisonbear in "My Agent Skill for Test-Driven Development"]]></title><description><![CDATA[
<p>Agree - all of this is based on vibes (I also use TDD based on vibes FWIW). The  only way to settle "does TDD / caveman / [insert random skill here] help" is to replay real PRs from your repo and measure quality</p>
]]></description><pubDate>Fri, 05 Jun 2026 22:54:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=48419390</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48419390</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48419390</guid></item><item><title><![CDATA[I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.stet.sh/blog/opus-48-vs-gpt-55-vs-opus-47-vs-composer-25">https://www.stet.sh/blog/opus-48-vs-gpt-55-vs-opus-47-vs-composer-25</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48386637">https://news.ycombinator.com/item?id=48386637</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 03 Jun 2026 17:06:22 +0000</pubDate><link>https://www.stet.sh/blog/opus-48-vs-gpt-55-vs-opus-47-vs-composer-25</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48386637</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48386637</guid></item><item><title><![CDATA[New comment by bisonbear in "I used autoresearch to improve my AGENTS.md, measured against real tasks"]]></title><description><![CDATA[
<p>> Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.<p>This is also generally where I've landed - keep the AGENTS.md super light, and link out to docs as needed. Same idea with skills as well. Basically, preserve the context window at all costs.<p>The part I'm curious about is, when we're making the sorts of behavior changes you're describing on shared repos, how do we actually measure and quantify impact? It's one thing to tell the team that the agent <i>should</i> perform better, and it's another to say that you made the agent 5% better across a variety of tasks for every dev in the repo.</p>
]]></description><pubDate>Thu, 28 May 2026 04:05:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=48304354</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48304354</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48304354</guid></item><item><title><![CDATA[New comment by bisonbear in "I used autoresearch to improve my AGENTS.md, measured against real tasks"]]></title><description><![CDATA[
<p>> we lack common tools to assess and compare<p>This has been bothering me for a while - the entire dev community is running on vibes when talking about AI. We're operating in an old paradigm, thinking that smart and logical additions to AGENTS.md result in good agent behavior, when in fact agents behavior is such a black box, that measurement is necessary.<p>> Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)<p>Even the rigging is hard to control - Anthropic has an interesting piece on this here <a href="https://www.anthropic.com/engineering/infrastructure-noise" rel="nofollow">https://www.anthropic.com/engineering/infrastructure-noise</a></p>
]]></description><pubDate>Thu, 28 May 2026 04:01:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=48304338</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48304338</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48304338</guid></item><item><title><![CDATA[New comment by bisonbear in "I used autoresearch to improve my AGENTS.md, measured against real tasks"]]></title><description><![CDATA[
<p>Yes, agree that low n makes overclaiming a real risk with this sort of optimization loop. Low n results can be useful directionally but can't claim superiority without expanding the dataset. If I were running this for a shared repo with real consequences / value to improving AGENTS.md, instead of just as an experiment, I would expand n by a few factors for training / holdout, depending on expected variation on the tasks.<p>I'm also noticing similar patterns with needing to update AGENTS.md / skills per model release. E.g with Opus 4.6 -> 4.7, it became much more instruction adherent, so instructions written for the prior model generation might cause unexpected behavior in the new generation. I'm also convinced that an optimal AGENTS.md for Codex is not the same file as an optimized CLAUDE.md for Claude - the model personalities and behaviors are so different that we probably need to tune the instructions differently as well.</p>
]]></description><pubDate>Thu, 28 May 2026 03:58:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=48304320</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48304320</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48304320</guid></item><item><title><![CDATA[I used autoresearch to improve my AGENTS.md, measured against real tasks]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.stet.sh/blog/how-i-used-codex-to-improve-its-own-agents-md">https://www.stet.sh/blog/how-i-used-codex-to-improve-its-own-agents-md</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48299687">https://news.ycombinator.com/item?id=48299687</a></p>
<p>Points: 8</p>
<p># Comments: 7</p>
]]></description><pubDate>Wed, 27 May 2026 19:56:09 +0000</pubDate><link>https://www.stet.sh/blog/how-i-used-codex-to-improve-its-own-agents-md</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48299687</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48299687</guid></item><item><title><![CDATA[A brief investigation into the GPT-5.5 regression claims]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.stet.sh/blog/gpt-55-high-regression-check-graphql-go-tools">https://www.stet.sh/blog/gpt-55-high-regression-check-graphql-go-tools</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48198356">https://news.ycombinator.com/item?id=48198356</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 19 May 2026 19:39:32 +0000</pubDate><link>https://www.stet.sh/blog/gpt-55-high-regression-check-graphql-go-tools</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48198356</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48198356</guid></item><item><title><![CDATA[New comment by bisonbear in "Ask HN: Do you still spend time maintaining Claude.md / AGENTS.md files?"]]></title><description><![CDATA[
<p>Yeah, I've found that to be more effective. Going with the example "Always clarify intent before acting" > "Never act without getting intent first", seemingly because telling the agent NOT to do something sometimes primes it to do that exact thing</p>
]]></description><pubDate>Sat, 16 May 2026 17:26:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=48162108</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48162108</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48162108</guid></item><item><title><![CDATA[New comment by bisonbear in "Ask HN: Do you still spend time maintaining Claude.md / AGENTS.md files?"]]></title><description><![CDATA[
<p>My advice, from doing this myself and reading best practices, would be:<p>- Keep it concise, use progressive disclosure / nested AGENTS.md for information expansion
- Give agent the high level repo structure if necessary
- Have a "why" section to align the agent, high level, what your code is doing
- Keep behavior instructions positive where possible, eg Always clarify intent before acting</p>
]]></description><pubDate>Sat, 16 May 2026 16:20:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=48161536</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48161536</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48161536</guid></item><item><title><![CDATA[New comment by bisonbear in "Ask HN: Do you still spend time maintaining Claude.md / AGENTS.md files?"]]></title><description><![CDATA[
<p>AGENTS.md is extremely important - it's probably the highest leverage thing you can give your agent. It's injected into every turn, and the agents are trained to follow instructions. If anything, I think people are under-investing into AGENTS.md and going purely based on vibes.<p>For example, if I write a bad AGENTS.md for a repo with 100 engineers actively working in it, then every agent for every engineer gets worse, without anyone really noticing.<p>I think we should move towards data-based tuning of AGENTS.md, testing out changes, gathering data, and then making a decision on whether or not to ship it.</p>
]]></description><pubDate>Sat, 16 May 2026 15:12:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=48160918</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48160918</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48160918</guid></item><item><title><![CDATA[New comment by bisonbear in "Ask HN: How do you catch regressions when you change your AI agent's prompt?"]]></title><description><![CDATA[
<p>I've been building a tool to do this - build a dataset based on tasks from your repo, then A/B test the agent with whatever change you're making to determine the impact prior to actually shipping it. If you want to check it out - stet.sh</p>
]]></description><pubDate>Sat, 16 May 2026 14:47:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=48160725</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48160725</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48160725</guid></item><item><title><![CDATA[New comment by bisonbear in "A Claude Code and Codex Skill for Deliberate Skill Development"]]></title><description><![CDATA[
<p>Not the OP, but I've been thinking about this problem a lot - as devs we're overly reliant on vibes for evaluating coding agents. This is already a problem, and especially so if you're working in an engineering organization where a bad edit to AGENTS.md can cause silent regressions for everyone in the codebase.<p>To solve this, I've built an agent-native tool to run evaluations based on merged PRs in your codebase. Basically you can ask Claude to evaluate whether the skill made things better/worse on real tasks, and to then iteratively improve it<p>Stalking your profile (sorry..) I see you're pretty deep in the eval space, so I'm super curious what your approach has been to being rigorous for things like skill changes?</p>
]]></description><pubDate>Thu, 14 May 2026 20:57:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=48141150</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48141150</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48141150</guid></item><item><title><![CDATA[The Opus 4.7 reasoning curve - Medium is the best default?]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.stet.sh/blog/opus-47-graphql-reasoning-curve">https://www.stet.sh/blog/opus-47-graphql-reasoning-curve</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48123075">https://news.ycombinator.com/item?id=48123075</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 13 May 2026 15:17:55 +0000</pubDate><link>https://www.stet.sh/blog/opus-47-graphql-reasoning-curve</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48123075</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48123075</guid></item><item><title><![CDATA[New comment by bisonbear in "Is Opus 4.7 a Downgrade?"]]></title><description><![CDATA[
<p>Claude <i>does</i> appear to work for longer, and use more tokens, when at higher reasoning modes. It just doesn't seem like this increased token usage leads to better actual outcomes</p>
]]></description><pubDate>Mon, 11 May 2026 12:38:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=48094205</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48094205</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48094205</guid></item><item><title><![CDATA[New comment by bisonbear in "Academic Research Skills for Claude Code"]]></title><description><![CDATA[
<p>Agree, it's impossible to tell if someone else's workflow works with your codebase without actually trying it, which takes time/tokens. I've been thinking about how to make running quick, directional evals easier / more efficient to give more confidence in using / developing skills. Basically, how do we go from vibes to data?</p>
]]></description><pubDate>Mon, 11 May 2026 02:38:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=48090490</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48090490</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48090490</guid></item><item><title><![CDATA[New comment by bisonbear in "Is Opus 4.7 a Downgrade?"]]></title><description><![CDATA[
<p>I'm actually currently working on benchmarking the opus 4.7 reasoning curve against real-world tasks, and have found that reasoning effort does not seem to monotonically improve results (at least on the slice I'm looking at). I've been puzzling about this but perhaps the fact that claude code has adaptive thinking explains some of it - even at medium reasoning effort, it can use more thinking tokens when needed to solve a complex problem.<p>Snapshot of the results (sorry for busted format, ask your llm for dataviz. cant seem to format a good table in the comments)<p>Opus 4.7 on GraphQL-go-tools:<p>Low: 23/29 pass, 10/29 equivalent, 5/29 review-pass, custom avg 2.598, $2.50/task, 384s/task<p>Medium: 28/29 pass, 14/29 equivalent, 10/29 review-pass, custom avg 2.759, $3.15/task, 451s/task<p>High: 26/29 pass, 12/29 equivalent, 7/29 review-pass, custom avg 2.670, $5.01/task, 716s/task<p>Xhigh: 25/29 pass, 11/29 equivalent, 4/29 review-pass, custom avg 2.669, $6.51/task, 804s/task<p>Max: 27/29 pass, 13/29 equivalent, 8/29 review-pass, custom avg 2.690, $8.84/task, 997s/task<p>(custom avg is a set of rubrics used for llm-as-a-judge, graded out of 4)<p>Practically, the results indicate that medium has better outcomes, or at least the same outcomes, considering variance, as higher reasoning efforts, at a much lower cost/time.</p>
]]></description><pubDate>Sat, 09 May 2026 14:06:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=48075111</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48075111</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48075111</guid></item><item><title><![CDATA[GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve">https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48065788">https://news.ycombinator.com/item?id=48065788</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 08 May 2026 16:58:30 +0000</pubDate><link>https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=48065788</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48065788</guid></item><item><title><![CDATA[GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.stet.sh/blog/gpt-55-vs-opus-47">https://www.stet.sh/blog/gpt-55-vs-opus-47</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47976390">https://news.ycombinator.com/item?id=47976390</a></p>
<p>Points: 4</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 01 May 2026 16:06:35 +0000</pubDate><link>https://www.stet.sh/blog/gpt-55-vs-opus-47</link><dc:creator>bisonbear</dc:creator><comments>https://news.ycombinator.com/item?id=47976390</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47976390</guid></item></channel></rss>