Hacker News: ofirpress

New comment by ofirpress in "SWE-bench Verified no longer measures frontier coding capabilities"

ofirpress — Sun, 26 Apr 2026 18:32:31 +0000

I'm a co-creator of SWE-bench:

1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.

3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :)

New comment by ofirpress in "Advancing AI Benchmarking with Game Arena"

ofirpress — Mon, 02 Feb 2026 18:23:51 +0000

This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -

We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.

https://codeclash.ai/

New comment by ofirpress in "Claude Code daily benchmarks for degradation tracking"

ofirpress — Thu, 29 Jan 2026 15:22:39 +0000

Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.

New comment by ofirpress in "Claude Code daily benchmarks for degradation tracking"

ofirpress — Thu, 29 Jan 2026 15:16:21 +0000

[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

New comment by ofirpress in "How to code Claude Code in 200 lines of code"

ofirpress — Thu, 08 Jan 2026 21:13:18 +0000

We (the SWE-bench team) have a 100 line of code agent that is now pretty popular in both academic and industry labs: https://github.com/SWE-agent/mini-swe-agent

I think it's a great way to dive into the agent world

New comment by ofirpress in "IQuest-Coder: A new open-source code model beats Claude Sonnet 4.5 and GPT 5.1 [pdf]"

ofirpress — Sat, 03 Jan 2026 05:59:37 +0000

As John says in that thread, we've fixed this issue in SWE-bench: https://xcancel.com/jyangballin/status/2006987724637757670

If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images

New comment by ofirpress in "Reflections on AI at the End of 2025"

ofirpress — Sat, 20 Dec 2025 20:39:59 +0000

> There are certain tasks, like improving a given program for speed, for instance, where in theory the model can continue to make progress with a very clear reward signal for a very long time.

Yup, this will absolutely be a big driver of gains in AI for coding in the near future. We actually built a benchmark based on this exact principle: https://algotune.io/

New comment by ofirpress in "Top model scores may be skewed by Git history leaks in SWE-bench"

ofirpress — Thu, 11 Sep 2025 19:45:56 +0000

[I'm on the SWE-bench team] Multiple people have looked into this, for example right in that thread: https://github.com/SWE-bench/SWE-bench/issues/465#issuecomme...

This issue had affected a tiny fraction of existing agents in a tiny fraction of their runs. And we've now issued a fix.

This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them. This doesn't change the overall picture or trends at all.

New comment by ofirpress in "Ask HN: How to Learn to Build Agentic AI Systems (Like Claude Code)"

ofirpress — Thu, 28 Aug 2025 20:07:58 +0000

We (the Princeton SWE-bench team) have a 100 line of code agent that does pretty well, you can read the code here: https://github.com/SWE-agent/mini-swe-agent

New comment by ofirpress in "How to build a coding agent"

ofirpress — Sun, 24 Aug 2025 03:55:08 +0000

We (the Princeton SWE-bench team) built an agent in ~100 lines of code that does pretty well on SWE-bench, you might enjoy it too: https://github.com/SWE-agent/mini-swe-agent

VideoGameBench from Princeton: Can vision-language models play 90s video games?

ofirpress — Thu, 29 May 2025 18:06:54 +0000

Article URL: https://www.vgbench.com/

Comments URL: https://news.ycombinator.com/item?id=44128603

Points: 6

# Comments: 1

New comment by ofirpress in "A Research Preview of Codex"

ofirpress — Sat, 17 May 2025 00:46:16 +0000

Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html

New comment by ofirpress in "A Research Preview of Codex"

ofirpress — Fri, 16 May 2025 17:51:33 +0000

[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.

VideoGameBench: Benchmarking video games for Vision Language Models

ofirpress — Thu, 17 Apr 2025 14:25:42 +0000

Article URL: https://www.vgbench.com/

Comments URL: https://news.ycombinator.com/item?id=43717340

Points: 4

# Comments: 0

New comment by ofirpress in "Richard Sutton and Andrew Barto Win 2024 Turing Award"

ofirpress — Wed, 05 Mar 2025 10:23:41 +0000

Good time to re-read The Bitter Lesson: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

New comment by ofirpress in "SOTA on swebench-verified: relearning the bitter lesson"

ofirpress — Thu, 09 Jan 2025 15:20:58 +0000

I'm one of the co-authors of SWE-bench. We just created a Javascript (+visual) SWE-bench: https://www.swebench.com/multimodal.html

We're going to release the eval suite for this soon so that people can start making submissions.

Why and How ChatGPT Works: Building 5 LMs at Increasing Complexity Levels [video]

ofirpress — Thu, 12 Oct 2023 20:50:35 +0000

Article URL: https://www.youtube.com/watch?v=s09NPN1BSdE

Comments URL: https://news.ycombinator.com/item?id=37863007

Points: 26

# Comments: 0

New comment by ofirpress in "Transformers from Scratch: Building 5 Language Models of Increasing Complexity [video]"

ofirpress — Wed, 20 Sep 2023 14:39:30 +0000

Thanks for posting this! I'm here if you have any questions.

New comment by ofirpress in "Attention with Linear Biases (ALiBi)"

ofirpress — Sun, 14 May 2023 07:15:29 +0000

The ALiBi paper shows that our method beats the sinusoidal PE you refer to across many benchmarks. https://arxiv.org/abs/2108.12409

New comment by ofirpress in "Attention with Linear Biases (ALiBi)"

ofirpress — Sun, 14 May 2023 07:14:24 +0000

(I wrote ALiBi)

Thanks for posting this! You can view a video where I explain what we did and why it's useful at: https://www.youtube.com/watch?v=Pp61ShI9VGc