<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: tedsanders</title><link>https://news.ycombinator.com/user?id=tedsanders</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 10 Jun 2026 08:43:49 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=tedsanders" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by tedsanders in "FrontierCode"]]></title><description><![CDATA[
<p>Makes sense, thanks. I suppose error bars are tricky if trying to handle problem-to-problem variance, rubric-to-rubric variance, and run-to-run variance all at once.</p>
]]></description><pubDate>Mon, 08 Jun 2026 23:37:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=48453976</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48453976</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48453976</guid></item><item><title><![CDATA[New comment by tedsanders in "Confidential submission of draft S-1 to the SEC"]]></title><description><![CDATA[
<p>The nonprofit (OpenAI Foundation) owns ~26% of the for-profit, plus some extra warrants.<p>The for-profit (OpenAI Group PBC) is what's filing the S-1 Draft.<p>The OpenAI Foundation also exclusively appoints the board of the OpenAI Group PBC and can replace directors at any time.<p><a href="https://openai.com/our-structure/" rel="nofollow">https://openai.com/our-structure/</a><p>(I work at OpenAI, but I am not a lawyer and am not speaking on behalf of OpenAI - just sharing my personal understanding.)</p>
]]></description><pubDate>Mon, 08 Jun 2026 22:24:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=48453175</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48453175</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48453175</guid></item><item><title><![CDATA[New comment by tedsanders in "FrontierCode"]]></title><description><![CDATA[
<p>Very cool! So glad to see people building and sharing evals that are better than SWE bench.<p>I'm curious - any particular reason you didn't put error bars on the graphs? Seems like it could be helpful when there are only 50 unique problems in the diamond set.</p>
]]></description><pubDate>Mon, 08 Jun 2026 22:03:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=48452920</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48452920</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48452920</guid></item><item><title><![CDATA[An OpenAI model has disproved a central conjecture in discrete geometry]]></title><description><![CDATA[
<p><a href="https://x.com/wtgowers/status/2057175727271800912" rel="nofollow">https://x.com/wtgowers/status/2057175727271800912</a>, <a href="https://xcancel.com/wtgowers/status/2057175727271800912" rel="nofollow">https://xcancel.com/wtgowers/status/2057175727271800912</a></p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48212493">https://news.ycombinator.com/item?id=48212493</a></p>
<p>Points: 1429</p>
<p># Comments: 1055</p>
]]></description><pubDate>Wed, 20 May 2026 19:05:30 +0000</pubDate><link>https://openai.com/index/model-disproves-discrete-geometry-conjecture/</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48212493</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48212493</guid></item><item><title><![CDATA[New comment by tedsanders in "Arena AI Model ELO History"]]></title><description><![CDATA[
<p>What do you mean by this? We don’t train on evals, and if we did I’d quit on the spot.<p>(The loose version of this that’s true is that there may exist eval data contamination in pretraining. This is a hard problem to fully solve.)</p>
]]></description><pubDate>Thu, 14 May 2026 16:21:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=48137586</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48137586</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48137586</guid></item><item><title><![CDATA[New comment by tedsanders in "Arena AI Model ELO History"]]></title><description><![CDATA[
<p>Thanks - let me clarify that we don’t switch to lightly quantized models by time of day or when under heavy load either.<p>(I used the adjective heavily because that’s what the original post said. I have no intention of making misleading but technically true statements.)</p>
]]></description><pubDate>Thu, 14 May 2026 16:15:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=48137505</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48137505</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48137505</guid></item><item><title><![CDATA[New comment by tedsanders in "Arena AI Model ELO History"]]></title><description><![CDATA[
<p>For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.</p>
]]></description><pubDate>Thu, 14 May 2026 05:46:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=48131525</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48131525</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48131525</guid></item><item><title><![CDATA[New comment by tedsanders in "Arena AI Model ELO History"]]></title><description><![CDATA[
<p>FYI, Elo isn't an acronym - it's a person's name. No need to capitalize it as ELO.</p>
]]></description><pubDate>Thu, 14 May 2026 05:43:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=48131505</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48131505</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48131505</guid></item><item><title><![CDATA[Get 2 months of Codex for your enterprise, free]]></title><description><![CDATA[
<p>Article URL: <a href="https://openai.com/form/codex-enterprise-promo/">https://openai.com/form/codex-enterprise-promo/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48128400">https://news.ycombinator.com/item?id=48128400</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 13 May 2026 22:24:52 +0000</pubDate><link>https://openai.com/form/codex-enterprise-promo/</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48128400</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48128400</guid></item><item><title><![CDATA[Tau-knowledge: benchmarking agents on real-world knowledge]]></title><description><![CDATA[
<p>Article URL: <a href="https://sierra.ai/blog/tau-knowledge">https://sierra.ai/blog/tau-knowledge</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48125010">https://news.ycombinator.com/item?id=48125010</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 13 May 2026 17:40:28 +0000</pubDate><link>https://sierra.ai/blog/tau-knowledge</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48125010</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48125010</guid></item><item><title><![CDATA[Mythos for Offensive Security: XBOW's Evaluation]]></title><description><![CDATA[
<p>Article URL: <a href="https://xbow.com/blog/mythos-offensive-security-xbow-evaluation">https://xbow.com/blog/mythos-offensive-security-xbow-evaluation</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48115998">https://news.ycombinator.com/item?id=48115998</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 12 May 2026 23:38:04 +0000</pubDate><link>https://xbow.com/blog/mythos-offensive-security-xbow-evaluation</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48115998</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48115998</guid></item><item><title><![CDATA[New comment by tedsanders in "Interaction Models"]]></title><description><![CDATA[
<p>Very cool! The demos felt fairly contrived - e.g., count things while I talk. I wonder what more useful or commercial applications look like.</p>
]]></description><pubDate>Mon, 11 May 2026 22:34:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=48101618</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48101618</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48101618</guid></item><item><title><![CDATA[New comment by tedsanders in "Daybreak Frontier AI for cyber defenders"]]></title><description><![CDATA[
<p>To clarify the title, Daybreak is not a new AI model or a new product. It's a rebranding of OpenAI for Cyber, which is an umbrella over multiple things that OpenAI is doing with companies.</p>
]]></description><pubDate>Mon, 11 May 2026 18:29:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=48098788</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48098788</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48098788</guid></item><item><title><![CDATA[New comment by tedsanders in "How OpenAI delivers low-latency voice AI at scale"]]></title><description><![CDATA[
<p>Dec 2025, actually: <a href="https://developers.openai.com/api/docs/models/gpt-5.5" rel="nofollow">https://developers.openai.com/api/docs/models/gpt-5.5</a><p>(though knowledge cutoffs in practice can be bit fuzzy)</p>
]]></description><pubDate>Tue, 05 May 2026 00:18:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=48016598</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=48016598</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48016598</guid></item><item><title><![CDATA[New comment by tedsanders in "SWE-bench Verified no longer measures frontier coding capabilities"]]></title><description><![CDATA[
<p>Whether a problem is "good" or "bad" is not always objective or simple.<p>For example, you can have problems that are underspecified, with hardcoded tests for a particular solution (out of multiple possible solutions). If your solution works fine but used a different function name than the one hardcoded in the tests, you can unfairly score 0.<p>When an eval has underspecified problems like these, you can still score 100% if you remember the original solution from your training data or if you just have taste similar to the original human authors. And both of these qualities - good memory and good taste - are great, but they'll be rewarded unfairly relative to a model that still did exactly what it was asked but in a different way than the hardcoded tests expected.</p>
]]></description><pubDate>Sun, 26 Apr 2026 21:01:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=47914397</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=47914397</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47914397</guid></item><item><title><![CDATA[New comment by tedsanders in "GPT-5.5"]]></title><description><![CDATA[
<p>We don't want hallucinations either, I promise you.<p>A few biased defenses:<p>- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.<p>- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."<p>- On the flip side, GPT-5.5 has the highest accuracy score.<p>- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.<p>- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.<p>- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.<p>Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.</p>
]]></description><pubDate>Fri, 24 Apr 2026 01:24:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=47884420</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=47884420</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47884420</guid></item><item><title><![CDATA[New comment by tedsanders in "GPT-5.5"]]></title><description><![CDATA[
<p>Honest answer is that it isn't done running yet. It takes some human bandwidth and time to run, so results weren't ready by this morning. We don't know what the score will be, but will probably go up on the leaderboard sometime soon. I personally don't put a lot of stock in the ARC-AGI evals, as it's not relevant to most work that people do, but should still be interesting to see as a measure of reasoning ability.<p>(I work at OpenAI.)</p>
]]></description><pubDate>Fri, 24 Apr 2026 01:11:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=47884330</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=47884330</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47884330</guid></item><item><title><![CDATA[New comment by tedsanders in "GPT-5.5"]]></title><description><![CDATA[
<p>Agreed. Would be great if everyone starts reporting cost per task alongside eval scores, especially in a world where you can spend arbitrary test-time compute. This is one thing I like about the Artificial Analysis website - they include cost to run alongside their eval scores: <a href="https://artificialanalysis.ai/" rel="nofollow">https://artificialanalysis.ai/</a></p>
]]></description><pubDate>Thu, 23 Apr 2026 20:09:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=47881120</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=47881120</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47881120</guid></item><item><title><![CDATA[New comment by tedsanders in "GPT-5.5"]]></title><description><![CDATA[
<p>Yep, it's more expensive per token.<p>However, I do want to emphasize that this is per token, not per task.<p>If we look at Opus 4.7, it uses smaller tokens (1-1.35x more than Opus 4.6) and it was also trained to think longer. <a href="https://www.anthropic.com/news/claude-opus-4-7" rel="nofollow">https://www.anthropic.com/news/claude-opus-4-7</a><p>On the Artificial Analysis Intelligence Index eval for example, in order to hit a score of 57%, Opus 4.7 takes ~5x as many output tokens as GPT-5.5, which dwarfs the difference in per-token pricing.<p>The token differential varies a lot by task, so it's hard to give a reliable rule of thumb (I'm guessing it's usually going to be well below ~5x), but hope this shows that price per task is not a linear function of price per token, as different models use different token vocabularies and different amounts of tokens.<p>We have raised per-token prices for our last couple models, but we've also made them a lot more efficient for the same capability level.<p>(I work at OpenAI.)</p>
]]></description><pubDate>Thu, 23 Apr 2026 19:28:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=47880473</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=47880473</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47880473</guid></item><item><title><![CDATA[New comment by tedsanders in "GPT-5.5"]]></title><description><![CDATA[
<p>Not this time, no.</p>
]]></description><pubDate>Thu, 23 Apr 2026 18:15:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=47879299</link><dc:creator>tedsanders</dc:creator><comments>https://news.ycombinator.com/item?id=47879299</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47879299</guid></item></channel></rss>