Hacker News: lieret

Show HN: All the LM solutions on SWE-bench are bloated compared to humans

lieret — Wed, 04 Mar 2026 15:44:17 +0000

Article URL: https://twitter.com/KLieret/status/2029219763423986030

Comments URL: https://news.ycombinator.com/item?id=47249164

Points: 1

# Comments: 0

Show HN: New eval from SWE-bench team evalutes LMs based on goals not tickets

lieret — Wed, 05 Nov 2025 16:13:16 +0000

Current evals test LMs on tasks: "fix this bug," "write a test"

But we code to achieve goals: maximize revenue, cut costs, win users

Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals.

Because real software dev isn’t about following instructions. It’s about achieving outcomes.

Here's how it works:

Two LMs enter a tournament. Each maintains its own codebase.

Every round:

1. Edit Phase: LMs modify their codebases however they like 2. Competition phase: Codebases battle in an arena. 3. Repeat

The LM that wins the majority of rounds is declared winner.

Arenas can be anything like games, trading sims, cybersec envs. We currently have 6 arenas implemented and support for 8 different programming languages.

This has been one of our biggest projects in terms of scale to date. Over the past few months, we've completed 1.5k tournaments, totalling more than 50,400 agent runs. And you can look at all of these runs right now from your browser (links below!)

You can find the rankings on our website (spoiler: Sonnet 4.5 tops the list), but almost more interesting: Humans are still way ahead! In one of our arena, even the worst solution from the human leaderboard is miles ahead of the best LM!

And we're not surprised: LMs consistently fail to properly adapt to outcomes, hallucinate about reasons for failure, and produce ever messier codebases with every round.

More information:

https://codeclash.ai/ https://arxiv.org/pdf/2511.00839 https://github.com/codeclash-ai/codeclash

Comments URL: https://news.ycombinator.com/item?id=45824582

Points: 5

# Comments: 1

New comment by lieret in "Top model scores may be skewed by Git history leaks in SWE-bench"

lieret — Fri, 12 Sep 2025 00:12:52 +0000

[On swe-bench team] We read and analyzed a lot of trajectories but seems like only recently models have started to exploit this in a small fraction of instances. But yes, clearly shouldn't have happened (and is now fixed in the new container versions).

New comment by lieret in "Top model scores may be skewed by Git history leaks in SWE-bench"

lieret — Thu, 11 Sep 2025 23:23:10 +0000

[On the SWE-bench team] As someone pointed out SWE-bench Verified is a subset of tasks that were reviewed to be solvable (i.e., have enough context in the task description) as well are scored with unit tests that aren't overly specific to rule out valid solutions.

We've all read & analyzed a large number of agent trajectories. This loophole seems to be something that popped up with the more recent models and we simply weren't aware of it.

As discussed in the github issue, there's a fix in the new version of the SWE-bench containers (currently being rolled out) that makes sure that the relevant commits aren't available.

Part of what makes SWE-bench a very interesting benchmark is the enormous action space that agents that compete on it can take. However that also means that there's unexpected things happening when models get better. We're currently working on making all agent runs easily browsable on a website (rather than having to download our AWS buckets) to get even more eyes on the trajectories. Thanks to everyone who uncovered this loophole.

New comment by lieret in "Top model scores may be skewed by Git history leaks in SWE-bench"

lieret — Thu, 11 Sep 2025 23:15:57 +0000

[Also on the SWE-bench team] Part of the reason why this didn't surface earlier was that it only seems to affect more recent models, maybe the result of reward hacking during posttraining. We're currently working on making trajectories easier to access for everyone through a web tool (rather than having to download things from aws) to get even more eyes on the trajectories. The interface will also include search & LM inspection tools to specifically look for anything that might qualify as cheating.

Show HN: Randomly switching between LMs at every step boosts SWE-bench score

lieret — Wed, 20 Aug 2025 15:09:32 +0000

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately.

GPT-5 by itself gets 65.0%, Sonnet 4 64.8%, but randomly switching at every step gets us 67.2%

This result came pretty surprising to us. There's a few more experiments in the blog post.

Comments URL: https://news.ycombinator.com/item?id=44962640

Points: 5

# Comments: 1

New comment by lieret in "GPT-5 on SWE-bench: Cost and performance deep-dive"

lieret — Fri, 08 Aug 2025 16:57:00 +0000

I think gpt-5-mini should really help them. At least from these benchmark scores, there probably shouldn't be a huge performance degradation for letting gpt-5-mini drive most of the workflow. Of course users might still want to just run with latest and greatest (but still gpt-5 will be cheaper I think)

New comment by lieret in "GPT-5 on SWE-bench: Cost and performance deep-dive"

lieret — Fri, 08 Aug 2025 16:29:14 +0000

We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini!

Cost is tricky to compare with agents, because agents succeed fast, but fail slowly. If an agent doesn't succeed, it should just continue trying until it succeeds, or hits a run time limit. And that's (almost) what happens.

But even so, it's very clear that

1. GPT-5 is cheaper than Sonnet 4 2. GPT-5-mini is _incredibly_ cheap for what it provides (you only sacrifice some 5%pts, but end up paying maybe 1/5th of the total cost)

All of the code to reproduce our numbers is open-source. There's a box on the bottom with the exact command to run in order to reproduce our numbers.

Also very happy to answer questions here!

GPT-5 on SWE-bench: Cost and performance deep-dive

lieret — Fri, 08 Aug 2025 16:29:14 +0000

Article URL: https://mini-swe-agent.com/latest/blog/2024/01/15/gpt-5-on-swe-bench-cost--performance-deep-dive/

Comments URL: https://news.ycombinator.com/item?id=44838879

Points: 4

# Comments: 3

New comment by lieret in "Show HN: Mini-swe-agent achieves 65% on SWE-bench in 100 lines of python"

lieret — Thu, 31 Jul 2025 14:55:59 +0000

Sorry, I missed that!

That's a little bit out of the scope of this project (because we were aiming for the bare minimum of what is needed to get a performative agent — and unfortunately learning from mistake also isn't measured by most benchmarks as they require tasks to be solved independently).

However, you can always add "memory" to agents by asking them to write and read from a file in your repo (Claude.md, cursorrules etc.) You can also try to automate this process and have a mechanism by which the LM decides itself when to put something in them. Similar to how memories work in chatGPT. I think Cursor also recently started doing that.

> checking for new versions of libraries, and write a list of tasks first before the execution

Just add it to the prompt! That's not always desired behavior for a command line helper, but I think it shouldn't be too hard to get it to do that just by prompting alone.

Show HN: New SWE-bench leaderboard compares LMs without fancy agent scaffolds

lieret — Thu, 31 Jul 2025 14:30:43 +0000

Hello from the SWE-bench/SWE-agent team at Princeton/Stanford.

When we created the SWE-bench benchmark in 2023 from hundreds of real-life GitHub issues/pull requests, the highest score was just a couple of percent. The tasks were so challenging for LMs, that most people didn't even want to work on them.

Half a year later, SWE-agent showed that the early 2024 LMs were actually good enough to resolve up to 20% of the GitHub issues in the benchmark. This kicked off a whole wave of coding agents.

Back then, developing agents was all about working around tons of silly behavior from the LMs. For example, if a command didn't work, they would try running the exact command again. If a command didn't return output, they would assume it never ran. They also couldn't get whitespace right in their edits, would get stuck into repetitive attempts and much much more.

So agents got pretty complicated to work around all of that bad LM behavior.

But now it's 2025, and LM companies have invested a whole lot of money to make their LMs really good at being agents.

So we asked two questions:

1. What's the simplest agent we can write that still scores near SotA? 2. How do LMs compare when we evaluate them using this simple agent?

Turns out, the agent can be very simple indeed! mini-swe-agent (https://github.com/SWE-agent/mini-swe-agent) has only 100 lines of code for the agent class (plus some 100 lines for environment etc.). It is little more than a loop that parses LM output for shell commands, executing them in a subshell, and continuing.

We then took various LMs and put them to the test in a real apples-to-apples comparison without a fancy agent scaffold to prop up bad LMs.

Our new leaderboard https://www.swebench.com/ shows the results.

The highest score is currently 65% with Claude Sonnet 4 (which is not much less than the 70% that most fancier agents observe).

o3, o4-mini, and Gemini 2.5 Pro are significantly behind, but not hopeless, achieving 50-60%.

We were really surprised by these strong numbers overall: It shows that as LMs get stronger and better adapted at performing difficult, highly iterative tasks, we can take our hands off of the steering wheel, provide the minimal necessary environment, and let the LM figure out the rest.

Let us know if you have any questions, our team is here on HN today :)

Comments URL: https://news.ycombinator.com/item?id=44746077

Points: 2

# Comments: 0

New comment by lieret in "Show HN: Mini-swe-agent achieves 65% on SWE-bench in 100 lines of python"

lieret — Fri, 25 Jul 2025 13:27:29 +0000

In 2024, we developed SWE-bench and SWE-agent at Princeton University and helped kickstart the coding agent revolution.

Back then, LMs were optimized to be great at chatting, but not much else. This meant that agent scaffolds had to get very creative (and complicated) to make LMs perform useful work.

But in 2025, LMs are actively optimized for agentic coding, and we ask:

*What the simplest coding agent that could still score near SotA on the benchmarks?*

*Turns out, it just requires 100 lines of code!*

And this system still *resolves 65% of all GitHub issues in the SWE-bench verified benchmark* with Sonnet 4 (for comparison, when Anthropic launched Sonnet 4, they reported 70% with their own scaffold that was never made public).

Honestly, we're all pretty stunned ourselves—we've now spent more than a year developing SWE-agent, and would not have thought that such a small system could perform nearly as good.

I'll link to the project below (all open-source, of course). The hello world example is incredibly short & simple (and literally what gave us the 65%). But it is also meant as a serious command line tool + research project, so we provide a Claude-code style UI & some utilities on top of that.

We have some team members from Princeton/Stanford here today, let us know if you have any questions/feedback :)

Show HN: Mini-swe-agent achieves 65% on SWE-bench in 100 lines of python

lieret — Fri, 25 Jul 2025 13:27:29 +0000

Article URL: https://github.com/SWE-agent/mini-swe-agent

Comments URL: https://news.ycombinator.com/item?id=44682897

Points: 7

# Comments: 4