<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: irthomasthomas</title><link>https://news.ycombinator.com/user?id=irthomasthomas</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 15 Jun 2026 00:22:48 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=irthomasthomas" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by irthomasthomas in "Amazon CEO's talks with U.S. officials triggered crackdown on Anthropic models"]]></title><description><![CDATA[
<p>I will certainly revisit it as more information comes out, but is it your contention that Anthropic solved jailbreaking with Mythos?</p>
]]></description><pubDate>Sat, 13 Jun 2026 23:23:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=48522492</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48522492</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48522492</guid></item><item><title><![CDATA[New comment by irthomasthomas in "Amazon CEO's talks with U.S. officials triggered crackdown on Anthropic models"]]></title><description><![CDATA[
<p>They literally asked for it. Two days ago Amodei wrote an essay urging the government to regulate them. He explicitly cited Mythos, as proof that frontier AI has acquired autonomous hacking capabilities that threaten critical infrastructure and national security.<p><pre><code>  "Mythos Preview scrambled the global cybersecurity landscape. But its broader significance is that it proves beyond doubt that AI models are now tools of global and national strategic consequence." 


  "The government should have the power to block or deter deployment of the model if it is determined, in light of third-party assessment, to present unacceptable risks. This power must be scoped to the above four specific risks and there must be protective measures against political favoritism or arbitrary decisions" 
</code></pre>
<a href="https://darioamodei.com/post/policy-on-the-ai-exponential" rel="nofollow">https://darioamodei.com/post/policy-on-the-ai-exponential</a><p>A third-party demonstrated that it was possible to jailbreak the safety measures of Fable to access the raw Mythos abilities. Abilities which Anthropic say are too dangerous for the public.<p>Edit. From David Sacks:<p><pre><code>  — A highly credible trusted partner of both Anthropic and the USG who was testing Fable came forward with a jailbreak of those guardrails. The Admin asked Dario to fix the jailbreak or de-deploy the model. Dario refused.

   — In their blog post, Anthropic defended its decision by saying the jailbreak isn’t serious. That is not what the trusted partner and the USG believe; nor is that kind of minimizing language consistent with Anthropic’s brand as the AI safety company. It’s difficult to fathom how they could claim a jailbreak allowing operability of a cyber weapon could be defined as not “serious".</code></pre></p>
]]></description><pubDate>Sat, 13 Jun 2026 22:24:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=48522106</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48522106</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48522106</guid></item><item><title><![CDATA[New comment by irthomasthomas in "Statement on US government directive to suspend access to Fable 5 and Mythos 5"]]></title><description><![CDATA[
<p>It should be easy for a company like Anthropic to prove this beyond a doubt. Why don't they? Why don't they have a collection of prompts and side-by-side comparisons with other models showing how far ahead they are?</p>
]]></description><pubDate>Sat, 13 Jun 2026 09:42:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=48515372</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48515372</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48515372</guid></item><item><title><![CDATA[New comment by irthomasthomas in "Kimi K2.7-Code: open-source coding model with better token efficiency"]]></title><description><![CDATA[
<p>according to this opencode and cursor cli perform better than claude code: <a href="https://x.com/kunchenguid/status/2065345999682568593" rel="nofollow">https://x.com/kunchenguid/status/2065345999682568593</a></p>
]]></description><pubDate>Fri, 12 Jun 2026 18:26:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=48507648</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48507648</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48507648</guid></item><item><title><![CDATA[New comment by irthomasthomas in "MiMo Code is now released and open-source"]]></title><description><![CDATA[
<p>I am experimenting with LFM2.5-8B-1A and getting 250tps on a 3060</p>
]]></description><pubDate>Thu, 11 Jun 2026 17:06:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=48493081</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48493081</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48493081</guid></item><item><title><![CDATA[New comment by irthomasthomas in "DiffusionGemma: 4x Faster Text Generation"]]></title><description><![CDATA[
<p>I have had a better experience with my own use. I use it every day and it rarely fails to improve tasks. Perhaps the prompts and rubrics make a difference. And finding bugs is one of the better use cases because it is essentially a search problem. As long as models are non-deterministic and there is some diversity in training data, then an ensemble that iterates on the problem is more likely to cover the ground needed to find solve a problem.<p>Some tasks benefit from this approach more than others. There was a paper from google on a version they made which was very similar and achieved SOTA then on planning and pathfinding benchmarks.<p>edit:<p>Mind Evolution paper
<a href="https://deepmind.google/research/publications/122391/" rel="nofollow">https://deepmind.google/research/publications/122391/</a><p>(That was a month after I published llm-consortium :)
<a href="https://xcancel.com/karpathy/status/1870692546969735361" rel="nofollow">https://xcancel.com/karpathy/status/1870692546969735361</a></p>
]]></description><pubDate>Wed, 10 Jun 2026 22:25:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=48483599</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48483599</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48483599</guid></item><item><title><![CDATA[New comment by irthomasthomas in "DiffusionGemma: 4x Faster Text Generation"]]></title><description><![CDATA[
<p>Mercury-2 is amazing. I am using it frequently as the arbiter in llm-consortium 
The context window is relatively small, so to make it work with larger consortiums I can construct a recursive sort-of meta consortium like this:<p><pre><code>  llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-kimi -m k2.6 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --arbiter mercury-2 --judging-method synthesis
</code></pre>
Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners.</p>
]]></description><pubDate>Wed, 10 Jun 2026 21:54:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=48483242</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48483242</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48483242</guid></item><item><title><![CDATA[New comment by irthomasthomas in "AWS Bedrock to require sharing data with Anthropic for Mythos and future models"]]></title><description><![CDATA[
<p>Is it a larger model or just better trained? Anthropic does not actually claim it is a larger model anywhere that I can see.</p>
]]></description><pubDate>Wed, 10 Jun 2026 11:54:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=48474951</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48474951</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48474951</guid></item><item><title><![CDATA[New comment by irthomasthomas in "Claude Fable 5"]]></title><description><![CDATA[
<p>Then it would be slower.</p>
]]></description><pubDate>Tue, 09 Jun 2026 20:57:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=48467659</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48467659</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48467659</guid></item><item><title><![CDATA[New comment by irthomasthomas in "Claude Fable 5"]]></title><description><![CDATA[
<p>"we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design).<p>...<p>Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."</p>
]]></description><pubDate>Tue, 09 Jun 2026 19:20:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=48466251</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48466251</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48466251</guid></item><item><title><![CDATA[New comment by irthomasthomas in "Claude Fable 5"]]></title><description><![CDATA[
<p>This is just the sales team doing their thing, applying the Law of Scarcity to drive demand.<p>It's the same exact speed as opus >=4.5, sonnet 4.5, and twice the speed of opus <=4.1<p>It must have about the same active parameters, or else its a larger model running in turbo mode (smaller batches) and being heavily subsidized for some reason. But given most of the benchmarks are within 5% I doubt it is a much larger model. Most perplexing.</p>
]]></description><pubDate>Tue, 09 Jun 2026 18:27:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=48465371</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48465371</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48465371</guid></item><item><title><![CDATA[New comment by irthomasthomas in "Claude Fable 5"]]></title><description><![CDATA[
<p>Anthropic has again changed the set of benchmarks they use[0]. This time they have also moved all benchmark scores to the PDF. At a glance it looks like it gains about ~5-10% over other models. the speed is about the same as opus >=4.5, sonnet 4.5, and double the speed of opus <=4.1<p><pre><code>                          Mythos 5 Fable 5 MythosPrev Opus 4.8 GPT-5.5 Gemini 3.1 Pro
  SWE-bench Pro             80.3       80        77.8       69.2      58.6       54.2
  SWE-bench Ver             95.5       95        93.9       88.6       -         80.6
  Terminal-Bench            88.0      84.3        -         82.7      83.4         -
  BrowseComp (Single-Agent) 88.0       -        87.9       84.3      84.4       85.9
  BrowseComp (Multi-Agent)  93.3       -          -         88.5       -           -
  HLE (No tools)            59.0      -       56.8      49.8      41.4        44.4
  HLE (Tools)                64.5      -        64.7     57.9      52.2       51.4
  CharXiv Reasoning (No tools) 88.9       -         86.2       80.5       -         -
  CharXiv Reasoning (Tools)    93.5       -         92.5      89.9      -         -
  BioMystery Bench (Human)     83.9       -       82.6     80.4       -         -
  BioMystery Bench (Hard)    46.1       -         29.6     40.0       -         -
  OSWorld-Verified          85.0      85.0       85.4       83.4      78.7      76.2*
  CritPt                     28.6       -       20.9       27.1      17.7       -
  ArxivMath                  78.5      68.7       71.8       71.5      64.0       -
</code></pre>
[0] <a href="https://news.ycombinator.com/item?id=48312633">https://news.ycombinator.com/item?id=48312633</a><p>Edit: Also in the system card... 
"we’ve
implemented new interventions that limit Claude’s effectiveness for requests targeting
frontier LLM development (for example, on building pretraining pipelines, distributed
training infrastructure, or ML accelerator design).<p>...<p>Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts,
these safeguards will not be visible to the user."</p>
]]></description><pubDate>Tue, 09 Jun 2026 17:42:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=48464600</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48464600</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48464600</guid></item><item><title><![CDATA[New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"]]></title><description><![CDATA[
<p>I must have confused mythos with opus 4.7. One of their recent model cards confirmed that training flops was under the EO reporting requirement of 10^26 flops.</p>
]]></description><pubDate>Tue, 09 Jun 2026 14:13:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=48461434</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48461434</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48461434</guid></item><item><title><![CDATA[New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"]]></title><description><![CDATA[
<p>Why ask me? Anyway, Mythos is not 10T. Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.<p>Anthropic also confirmed they will not release Mythos, only a "Mythos-class" model, whatever that means.</p>
]]></description><pubDate>Tue, 09 Jun 2026 08:49:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=48458439</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48458439</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48458439</guid></item><item><title><![CDATA[New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"]]></title><description><![CDATA[
<p>I should have stressed the symbolic part. Everyone has pivoted to symbolic systems like claude code and codex. They would no invest so heavily in such systems if they thought llms would deliver agi soon.</p>
]]></description><pubDate>Tue, 09 Jun 2026 08:40:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=48458371</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48458371</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48458371</guid></item><item><title><![CDATA[New comment by irthomasthomas in "Ask HN: What are tools you have made for yourself since the advent of AI?"]]></title><description><![CDATA[
<p>llm-consortium: prompts multiple models in parallel, loops until confidence_threshold, and iteratively refines a response.<p>This was inspired by a karpathy tweet [0] and the prototype created using another tool of mine: The LLM Plugin Generator plugin (essentially a curated collection of plugins for simonws llm cli as a few-shot prompt)<p>The llm-model-gateway companion plugin lets you serve models from the LLM cli as a an openai API. This allows you to use saved consortiums in your various clients as if they where a regular model. Bringing massive parallel reasoning to any workflow.<p>It occured to me at some time that an collection of parallel LLMs was not really a consortium. A consortium is a group of organizations. A group of groups. To rectify this I added for actual consortiums, where each member of an llm-consortium can itself be a consortium of models. e.g.<p>llm consortium save cns-glm-n3 -m glm-5.1 -n 3 --arbiter mercury-2<p>llm consortium save cns-k2-n3 -m kimi-k2.6:3 --arbiter mercury-2<p>llm consortium save cns-meta-glm-k2 -m cns-k2-n3 -m cns-glm-n3 --arbiter cns-k2-n3<p>Yes, even the arbiter/judge can be comprised of a consortium of models, bringing parallel reasoning to the task of judging parallel reasoning chains.<p>Consortiums can also now contain groups of specialists. These custom user-defined expert characters address the prompt from a different perspective. And a Westworld style Attribute matrix can be randomized to inject some more entropy into the process.<p>[0]<a href="https://xcancel.com/karpathy/status/1870692546969735361" rel="nofollow">https://xcancel.com/karpathy/status/1870692546969735361</a><p>Some other llm plugins I vibe coded:<p>classifai 
 generates labels with approximate confidence derived from logprobs<p>llm-alias-options 
 saves inference parameters such as reasoning effort with a model alias. (good for setting the provider in openrouter or creating a consortium of high temperature models)<p>llm-prompt-json 
 adds a --json flag to return the llm logs object (good for getting conversion_id, or reasoning output in scripts)<p>llm-jina adds support for all  jina AI specialised models and tools like web fetching, embedding and reranking.</p>
]]></description><pubDate>Mon, 08 Jun 2026 20:57:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=48451913</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48451913</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48451913</guid></item><item><title><![CDATA[New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"]]></title><description><![CDATA[
<p>No one is bitter lesson pilled anymore. Everyone is pivoting to neurosymbolic systems. It looks like Gary Marcus was right.</p>
]]></description><pubDate>Mon, 08 Jun 2026 20:17:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=48451256</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48451256</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48451256</guid></item><item><title><![CDATA[New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"]]></title><description><![CDATA[
<p>I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.</p>
]]></description><pubDate>Mon, 08 Jun 2026 15:58:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=48447057</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48447057</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48447057</guid></item><item><title><![CDATA[New comment by irthomasthomas in "DeepSeek V4 Pro beats GPT-5.5 Pro on precision"]]></title><description><![CDATA[
<p>Actually, simonw has started saying that after qwen 27B beat Opus 4.7<p><a href="https://news.ycombinator.com/item?id=48446348">https://news.ycombinator.com/item?id=48446348</a></p>
]]></description><pubDate>Mon, 08 Jun 2026 15:54:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=48446997</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48446997</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48446997</guid></item><item><title><![CDATA[New comment by irthomasthomas in "DeepSeek V4 Pro beats GPT-5.5 Pro on precision"]]></title><description><![CDATA[
<p>Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.<p><pre><code>  "there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models ...
 
  Today, even that loose connection to utility has been broken..." 
</code></pre>
<a href="https://simonwillison.net/2026/Apr/16/qwen-beats-opus/" rel="nofollow">https://simonwillison.net/2026/Apr/16/qwen-beats-opus/</a></p>
]]></description><pubDate>Mon, 08 Jun 2026 15:06:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=48446348</link><dc:creator>irthomasthomas</dc:creator><comments>https://news.ycombinator.com/item?id=48446348</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48446348</guid></item></channel></rss>