<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: lieret</title><link>https://news.ycombinator.com/user?id=lieret</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 27 Apr 2026 17:26:28 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=lieret" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[Show HN: All the LM solutions on SWE-bench are bloated compared to humans]]></title><description><![CDATA[
<p>Article URL: <a href="https://twitter.com/KLieret/status/2029219763423986030">https://twitter.com/KLieret/status/2029219763423986030</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47249164">https://news.ycombinator.com/item?id=47249164</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 04 Mar 2026 15:44:17 +0000</pubDate><link>https://twitter.com/KLieret/status/2029219763423986030</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=47249164</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47249164</guid></item><item><title><![CDATA[Show HN: New eval from SWE-bench team evalutes LMs based on goals not tickets]]></title><description><![CDATA[
<p>Current evals test LMs on tasks: "fix this bug," "write a test"<p>But we code to achieve goals: maximize revenue, cut costs, win users<p>Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals.<p>Because real software dev isn’t about following instructions. It’s about achieving outcomes.<p>Here's how it works:<p>Two LMs enter a tournament. Each maintains its own codebase.<p>Every round:<p>1. Edit Phase: LMs modify their codebases however they like
2. Competition phase: Codebases battle in an arena.
3. Repeat<p>The LM that wins the majority of rounds is declared winner.<p>Arenas can be anything like games, trading sims, cybersec envs. We currently have 6 arenas implemented and support for 8 different programming languages.<p>This has been one of our biggest projects in terms of scale to date. Over the past few months, we've completed 1.5k tournaments, totalling more than 50,400 agent runs. And you can look at all of these runs right now from your browser (links below!)<p>You can find the rankings on our website (spoiler: Sonnet 4.5 tops the list), but almost more interesting: Humans are still way ahead! In one of our arena, even the worst solution from the human leaderboard is miles ahead of the best LM!<p>And we're not surprised: LMs consistently fail to properly adapt to outcomes, hallucinate about reasons for failure, and produce ever messier codebases with every round.<p>More information:<p><a href="https://codeclash.ai/" rel="nofollow">https://codeclash.ai/</a>
<a href="https://arxiv.org/pdf/2511.00839" rel="nofollow">https://arxiv.org/pdf/2511.00839</a>
<a href="https://github.com/codeclash-ai/codeclash" rel="nofollow">https://github.com/codeclash-ai/codeclash</a></p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45824582">https://news.ycombinator.com/item?id=45824582</a></p>
<p>Points: 5</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 05 Nov 2025 16:13:16 +0000</pubDate><link>https://codeclash.ai/</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=45824582</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45824582</guid></item><item><title><![CDATA[New comment by lieret in "Top model scores may be skewed by Git history leaks in SWE-bench"]]></title><description><![CDATA[
<p>[On swe-bench team] We read and analyzed a lot of trajectories but seems like only recently models have started to exploit this in a small fraction of instances. But yes, clearly shouldn't have happened (and is now fixed in the new container versions).</p>
]]></description><pubDate>Fri, 12 Sep 2025 00:12:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=45217428</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=45217428</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45217428</guid></item><item><title><![CDATA[New comment by lieret in "Top model scores may be skewed by Git history leaks in SWE-bench"]]></title><description><![CDATA[
<p>[On the SWE-bench team]  As someone pointed out SWE-bench Verified is a subset of tasks that were reviewed to be solvable (i.e., have enough context in the task description) as well are scored with unit tests that aren't overly specific to rule out valid solutions.<p>We've all read & analyzed a large number of agent trajectories. This loophole seems to be something that popped up with the more recent models and we simply weren't aware of it.<p>As discussed in the github issue, there's a fix in the new version of the SWE-bench containers (currently being rolled out) that makes sure that the relevant commits aren't available.<p>Part of what makes SWE-bench a very interesting benchmark is the enormous action space that agents that compete on it can take. However that also means that there's unexpected things happening when models get better. We're currently working on making all agent runs easily browsable on a website (rather than having to download our AWS buckets) to get even more eyes on the trajectories. Thanks to everyone who uncovered this loophole.</p>
]]></description><pubDate>Thu, 11 Sep 2025 23:23:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=45217129</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=45217129</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45217129</guid></item><item><title><![CDATA[New comment by lieret in "Top model scores may be skewed by Git history leaks in SWE-bench"]]></title><description><![CDATA[
<p>[Also on the SWE-bench team] Part of the reason why this didn't surface earlier was that it only seems to affect more recent models, maybe the result of reward hacking during posttraining. We're currently working on making trajectories easier to access for everyone through a web tool (rather than having to download things from aws) to get even more eyes on the trajectories. The interface will also include search & LM inspection tools to specifically look for anything that might qualify as cheating.</p>
]]></description><pubDate>Thu, 11 Sep 2025 23:15:57 +0000</pubDate><link>https://news.ycombinator.com/item?id=45217083</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=45217083</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45217083</guid></item><item><title><![CDATA[Show HN: Randomly switching between LMs at every step boosts SWE-bench score]]></title><description><![CDATA[
<p>What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately.<p>GPT-5 by itself gets 65.0%, Sonnet 4 64.8%, but randomly switching at every step gets us 67.2%<p>This result came pretty surprising to us. There's a few more experiments in the blog post.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44962640">https://news.ycombinator.com/item?id=44962640</a></p>
<p>Points: 5</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 20 Aug 2025 15:09:32 +0000</pubDate><link>https://www.swebench.com/SWE-bench/blog/2025/08/19/mini-roulette/</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=44962640</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44962640</guid></item><item><title><![CDATA[New comment by lieret in "GPT-5 on SWE-bench: Cost and performance deep-dive"]]></title><description><![CDATA[
<p>I think gpt-5-mini should really help them. At least from these benchmark scores, there probably shouldn't be a huge performance degradation for letting gpt-5-mini drive most of the workflow. Of course users might still want to just run with latest and greatest (but still gpt-5 will be cheaper I think)</p>
]]></description><pubDate>Fri, 08 Aug 2025 16:57:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=44839226</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=44839226</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44839226</guid></item><item><title><![CDATA[New comment by lieret in "GPT-5 on SWE-bench: Cost and performance deep-dive"]]></title><description><![CDATA[
<p>We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini!<p>Cost is tricky to compare with agents, because agents succeed fast, but fail slowly. If an agent doesn't succeed, it should just continue trying until it succeeds, or hits a run time limit. And that's (almost) what happens.<p>But even so, it's very clear that<p>1. GPT-5 is cheaper than Sonnet 4
2. GPT-5-mini is _incredibly_ cheap for what it provides (you only sacrifice some 5%pts, but end up paying maybe 1/5th of the total cost)<p>All of the code to reproduce our numbers is open-source. There's a box on the bottom with the exact command to run in order to reproduce our numbers.<p>Also very happy to answer questions here!</p>
]]></description><pubDate>Fri, 08 Aug 2025 16:29:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=44838880</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=44838880</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44838880</guid></item><item><title><![CDATA[GPT-5 on SWE-bench: Cost and performance deep-dive]]></title><description><![CDATA[
<p>Article URL: <a href="https://mini-swe-agent.com/latest/blog/2024/01/15/gpt-5-on-swe-bench-cost--performance-deep-dive/">https://mini-swe-agent.com/latest/blog/2024/01/15/gpt-5-on-swe-bench-cost--performance-deep-dive/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44838879">https://news.ycombinator.com/item?id=44838879</a></p>
<p>Points: 4</p>
<p># Comments: 3</p>
]]></description><pubDate>Fri, 08 Aug 2025 16:29:14 +0000</pubDate><link>https://mini-swe-agent.com/latest/blog/2024/01/15/gpt-5-on-swe-bench-cost--performance-deep-dive/</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=44838879</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44838879</guid></item><item><title><![CDATA[New comment by lieret in "Show HN: Mini-swe-agent achieves 65% on SWE-bench in 100 lines of python"]]></title><description><![CDATA[
<p>Sorry, I missed that!<p>That's a little bit out of the scope of this project (because we were aiming for the bare minimum of what is needed to get a performative agent — and unfortunately learning from mistake also isn't measured by most benchmarks as they require tasks to be solved independently).<p>However, you can always add "memory" to agents by asking them to write and read from a file in your repo (Claude.md, cursorrules etc.)  You can also try to automate this process and have a mechanism by which the LM decides itself when to put something in them. Similar to how memories work in chatGPT. I think Cursor also recently started doing that.<p>> checking for new versions of libraries, and write a list of tasks first before the execution<p>Just add it to the prompt! That's not always desired behavior for a command line helper, but I think it shouldn't be too hard to get it to do that just by prompting alone.</p>
]]></description><pubDate>Thu, 31 Jul 2025 14:55:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=44746342</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=44746342</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44746342</guid></item><item><title><![CDATA[Show HN: New SWE-bench leaderboard compares LMs without fancy agent scaffolds]]></title><description><![CDATA[
<p>Hello from the SWE-bench/SWE-agent team at Princeton/Stanford.<p>When we created the SWE-bench benchmark in 2023 from hundreds of real-life GitHub issues/pull requests, the highest score was just a couple of percent. The tasks were so challenging for LMs, that most people didn't even want to work on them.<p>Half a year later, SWE-agent showed that the early 2024 LMs were actually good enough to resolve up to 20% of the GitHub issues in the benchmark. This kicked off a whole wave of coding agents.<p>Back then, developing agents was all about working around tons of silly behavior from the LMs. For example, if a command didn't work, they would try running the exact command again. If a command didn't return output, they would assume it never ran. They also couldn't get whitespace right in their edits, would get stuck into repetitive attempts and much much more.<p>So agents got pretty complicated to work around all of that bad LM behavior.<p>But now it's 2025, and LM companies have invested a whole lot of money to make their LMs really good at being agents.<p>So we asked two questions:<p>1. What's the simplest agent we can write that still scores near SotA?
2. How do LMs compare when we evaluate them using this simple agent?<p>Turns out, the agent can be very simple indeed! mini-swe-agent (<a href="https://github.com/SWE-agent/mini-swe-agent">https://github.com/SWE-agent/mini-swe-agent</a>) has only 100 lines of code for the agent class (plus some 100 lines for environment etc.). 
It is little more than a loop that parses LM output for shell commands, executing them in a subshell, and continuing.<p>We then took various LMs and put them to the test in a real apples-to-apples comparison without a fancy agent scaffold to prop up bad LMs.<p>Our new leaderboard <a href="https://www.swebench.com/" rel="nofollow">https://www.swebench.com/</a> shows the results.<p>The highest score is currently 65% with Claude Sonnet 4 (which is not much less than the 70% that most fancier agents observe).<p>o3, o4-mini, and Gemini 2.5 Pro are significantly behind, but not hopeless, achieving 50-60%.<p>We were really surprised by these strong numbers overall: It shows that as LMs get stronger and better adapted at performing difficult, highly iterative tasks, we can take our hands off of the steering wheel, provide the minimal necessary environment, and let the LM figure out the rest.<p>Let us know if you have any questions, our team is here on HN today :)</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44746077">https://news.ycombinator.com/item?id=44746077</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 31 Jul 2025 14:30:43 +0000</pubDate><link>https://www.swebench.com/</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=44746077</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44746077</guid></item><item><title><![CDATA[New comment by lieret in "Show HN: Mini-swe-agent achieves 65% on SWE-bench in 100 lines of python"]]></title><description><![CDATA[
<p>In 2024, we developed SWE-bench and SWE-agent at Princeton University and helped kickstart the coding agent revolution.<p>Back then, LMs were optimized to be great at chatting, but not much else. This meant that agent scaffolds had to get very creative (and complicated) to make LMs perform useful work.<p>But in 2025, LMs are actively optimized for agentic coding, and we ask:<p>*What the simplest coding agent that could still score near SotA on the benchmarks?*<p>*Turns out, it just requires 100 lines of code!*<p>And this system still *resolves 65% of all GitHub issues in the SWE-bench verified benchmark* with Sonnet 4 (for comparison, when Anthropic launched Sonnet 4, they reported 70% with their own scaffold that was never made public).<p>Honestly, we're all pretty stunned ourselves—we've now spent more than a year developing SWE-agent, and would not have thought that such a small system could perform nearly as good.<p>I'll link to the project below (all open-source, of course). The hello world example is incredibly short & simple (and literally what gave us the 65%). But it is also meant as a serious command line tool + research project, so we provide a Claude-code style UI & some utilities on top of that.<p>We have some team members from Princeton/Stanford here today, let us know if you have any questions/feedback :)</p>
]]></description><pubDate>Fri, 25 Jul 2025 13:27:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=44682898</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=44682898</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44682898</guid></item><item><title><![CDATA[Show HN: Mini-swe-agent achieves 65% on SWE-bench in 100 lines of python]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/SWE-agent/mini-swe-agent">https://github.com/SWE-agent/mini-swe-agent</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44682897">https://news.ycombinator.com/item?id=44682897</a></p>
<p>Points: 7</p>
<p># Comments: 4</p>
]]></description><pubDate>Fri, 25 Jul 2025 13:27:29 +0000</pubDate><link>https://github.com/SWE-agent/mini-swe-agent</link><dc:creator>lieret</dc:creator><comments>https://news.ycombinator.com/item?id=44682897</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44682897</guid></item></channel></rss>