<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: ofirpress</title><link>https://news.ycombinator.com/user?id=ofirpress</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 29 Apr 2026 08:03:34 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=ofirpress" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by ofirpress in "SWE-bench Verified no longer measures frontier coding capabilities"]]></title><description><![CDATA[
<p>I'm a co-creator of SWE-bench:<p>1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.<p>2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.<p>3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example <a href="https://codeclash.ai/" rel="nofollow">https://codeclash.ai/</a> or <a href="https://algotune.io/" rel="nofollow">https://algotune.io/</a> . And we'll have more to say soon :)</p>
]]></description><pubDate>Sun, 26 Apr 2026 18:32:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=47912620</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=47912620</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47912620</guid></item><item><title><![CDATA[New comment by ofirpress in "Advancing AI Benchmarking with Game Arena"]]></title><description><![CDATA[
<p>This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -<p>We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an <i>agent</i> written by Claude plays poker against an <i>agent</i> written by GPT, and this really tough task leads to very interesting findings on AI for coding.<p><a href="https://codeclash.ai/" rel="nofollow">https://codeclash.ai/</a></p>
]]></description><pubDate>Mon, 02 Feb 2026 18:23:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=46859331</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=46859331</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46859331</guid></item><item><title><![CDATA[New comment by ofirpress in "Claude Code daily benchmarks for degradation tracking"]]></title><description><![CDATA[
<p>Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.</p>
]]></description><pubDate>Thu, 29 Jan 2026 15:22:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=46811406</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=46811406</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46811406</guid></item><item><title><![CDATA[New comment by ofirpress in "Claude Code daily benchmarks for degradation tracking"]]></title><description><![CDATA[
<p>[SWE-bench co-author here]
It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. 
I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.</p>
]]></description><pubDate>Thu, 29 Jan 2026 15:16:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=46811319</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=46811319</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46811319</guid></item><item><title><![CDATA[New comment by ofirpress in "How to code Claude Code in 200 lines of code"]]></title><description><![CDATA[
<p>We (the SWE-bench team) have a 100 line of code agent that is now pretty popular in both academic and industry labs: <a href="https://github.com/SWE-agent/mini-swe-agent" rel="nofollow">https://github.com/SWE-agent/mini-swe-agent</a><p>I think it's a great way to dive into the agent world</p>
]]></description><pubDate>Thu, 08 Jan 2026 21:13:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=46546553</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=46546553</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46546553</guid></item><item><title><![CDATA[New comment by ofirpress in "IQuest-Coder: A new open-source code model beats Claude Sonnet 4.5 and GPT 5.1 [pdf]"]]></title><description><![CDATA[
<p>As John says in that thread, we've fixed this issue in SWE-bench: <a href="https://xcancel.com/jyangballin/status/2006987724637757670" rel="nofollow">https://xcancel.com/jyangballin/status/2006987724637757670</a><p>If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images</p>
]]></description><pubDate>Sat, 03 Jan 2026 05:59:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=46473210</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=46473210</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46473210</guid></item><item><title><![CDATA[New comment by ofirpress in "Reflections on AI at the End of 2025"]]></title><description><![CDATA[
<p>> There are certain tasks, like improving a given program for speed, for instance, where in theory the model can continue to make progress with a very clear reward signal for a very long time.<p>Yup, this will absolutely be a big driver of gains in AI for coding in the near future. We actually built a benchmark based on this exact principle: <a href="https://algotune.io/" rel="nofollow">https://algotune.io/</a></p>
]]></description><pubDate>Sat, 20 Dec 2025 20:39:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=46339456</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=46339456</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46339456</guid></item><item><title><![CDATA[New comment by ofirpress in "Top model scores may be skewed by Git history leaks in SWE-bench"]]></title><description><![CDATA[
<p>[I'm on the SWE-bench team] Multiple people have looked into this, for example right in that thread: <a href="https://github.com/SWE-bench/SWE-bench/issues/465#issuecomment-3258065126" rel="nofollow">https://github.com/SWE-bench/SWE-bench/issues/465#issuecomme...</a><p>This issue had affected a tiny fraction of existing agents in a tiny fraction of their runs. And we've now issued a fix.<p>This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them. This doesn't change the overall picture or trends at all.</p>
]]></description><pubDate>Thu, 11 Sep 2025 19:45:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=45215416</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=45215416</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45215416</guid></item><item><title><![CDATA[New comment by ofirpress in "Ask HN: How to Learn to Build Agentic AI Systems (Like Claude Code)"]]></title><description><![CDATA[
<p>We (the Princeton SWE-bench team) have a 100 line of code agent that does pretty well, you can read the code here: <a href="https://github.com/SWE-agent/mini-swe-agent" rel="nofollow">https://github.com/SWE-agent/mini-swe-agent</a></p>
]]></description><pubDate>Thu, 28 Aug 2025 20:07:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=45056491</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=45056491</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45056491</guid></item><item><title><![CDATA[New comment by ofirpress in "How to build a coding agent"]]></title><description><![CDATA[
<p>We (the Princeton SWE-bench team) built an agent in ~100 lines of code that does pretty well on SWE-bench, you might enjoy it too: <a href="https://github.com/SWE-agent/mini-swe-agent" rel="nofollow">https://github.com/SWE-agent/mini-swe-agent</a></p>
]]></description><pubDate>Sun, 24 Aug 2025 03:55:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=45001234</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=45001234</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45001234</guid></item><item><title><![CDATA[VideoGameBench from Princeton: Can vision-language models play 90s video games?]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.vgbench.com/">https://www.vgbench.com/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44128603">https://news.ycombinator.com/item?id=44128603</a></p>
<p>Points: 6</p>
<p># Comments: 1</p>
]]></description><pubDate>Thu, 29 May 2025 18:06:54 +0000</pubDate><link>https://www.vgbench.com/</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=44128603</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44128603</guid></item><item><title><![CDATA[New comment by ofirpress in "A Research Preview of Codex"]]></title><description><![CDATA[
<p>Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.<p>We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere:
<a href="https://www.swebench.com/multimodal.html" rel="nofollow">https://www.swebench.com/multimodal.html</a></p>
]]></description><pubDate>Sat, 17 May 2025 00:46:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=44011138</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=44011138</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44011138</guid></item><item><title><![CDATA[New comment by ofirpress in "A Research Preview of Codex"]]></title><description><![CDATA[
<p>[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.</p>
]]></description><pubDate>Fri, 16 May 2025 17:51:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=44008115</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=44008115</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44008115</guid></item><item><title><![CDATA[VideoGameBench: Benchmarking video games for Vision Language Models]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.vgbench.com/">https://www.vgbench.com/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43717340">https://news.ycombinator.com/item?id=43717340</a></p>
<p>Points: 4</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 17 Apr 2025 14:25:42 +0000</pubDate><link>https://www.vgbench.com/</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=43717340</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43717340</guid></item><item><title><![CDATA[New comment by ofirpress in "Richard Sutton and Andrew Barto Win 2024 Turing Award"]]></title><description><![CDATA[
<p>Good time to re-read The Bitter Lesson: <a href="https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf" rel="nofollow">https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...</a></p>
]]></description><pubDate>Wed, 05 Mar 2025 10:23:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=43264946</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=43264946</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43264946</guid></item><item><title><![CDATA[New comment by ofirpress in "SOTA on swebench-verified: relearning the bitter lesson"]]></title><description><![CDATA[
<p>I'm one of the co-authors of SWE-bench. We just created a Javascript (+visual) SWE-bench: <a href="https://www.swebench.com/multimodal.html" rel="nofollow">https://www.swebench.com/multimodal.html</a><p>We're going to release the eval suite for this soon so that people can start making submissions.</p>
]]></description><pubDate>Thu, 09 Jan 2025 15:20:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=42646506</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=42646506</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42646506</guid></item><item><title><![CDATA[Why and How ChatGPT Works: Building 5 LMs at Increasing Complexity Levels [video]]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.youtube.com/watch?v=s09NPN1BSdE">https://www.youtube.com/watch?v=s09NPN1BSdE</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=37863007">https://news.ycombinator.com/item?id=37863007</a></p>
<p>Points: 26</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 12 Oct 2023 20:50:35 +0000</pubDate><link>https://www.youtube.com/watch?v=s09NPN1BSdE</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=37863007</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37863007</guid></item><item><title><![CDATA[New comment by ofirpress in "Transformers from Scratch: Building 5 Language Models of Increasing Complexity [video]"]]></title><description><![CDATA[
<p>Thanks for posting this! I'm here if you have any questions.</p>
]]></description><pubDate>Wed, 20 Sep 2023 14:39:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=37584760</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=37584760</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37584760</guid></item><item><title><![CDATA[New comment by ofirpress in "Attention with Linear Biases (ALiBi)"]]></title><description><![CDATA[
<p>The ALiBi paper shows that our method beats the sinusoidal PE you refer to across many benchmarks. <a href="https://arxiv.org/abs/2108.12409" rel="nofollow">https://arxiv.org/abs/2108.12409</a></p>
]]></description><pubDate>Sun, 14 May 2023 07:15:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=35936031</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=35936031</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=35936031</guid></item><item><title><![CDATA[New comment by ofirpress in "Attention with Linear Biases (ALiBi)"]]></title><description><![CDATA[
<p>(I wrote ALiBi)<p>Thanks for posting this! You can view a video where I explain what we did and why it's useful at: <a href="https://www.youtube.com/watch?v=Pp61ShI9VGc">https://www.youtube.com/watch?v=Pp61ShI9VGc</a></p>
]]></description><pubDate>Sun, 14 May 2023 07:14:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=35936024</link><dc:creator>ofirpress</dc:creator><comments>https://news.ycombinator.com/item?id=35936024</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=35936024</guid></item></channel></rss>