Hacker News: irthomasthomas

New comment by irthomasthomas in "Amazon CEO's talks with U.S. officials triggered crackdown on Anthropic models"

irthomasthomas — Sat, 13 Jun 2026 23:23:14 +0000

I will certainly revisit it as more information comes out, but is it your contention that Anthropic solved jailbreaking with Mythos?

New comment by irthomasthomas in "Amazon CEO's talks with U.S. officials triggered crackdown on Anthropic models"

irthomasthomas — Sat, 13 Jun 2026 22:24:26 +0000

They literally asked for it. Two days ago Amodei wrote an essay urging the government to regulate them. He explicitly cited Mythos, as proof that frontier AI has acquired autonomous hacking capabilities that threaten critical infrastructure and national security.

  "Mythos Preview scrambled the global cybersecurity landscape. But its broader significance is that it proves beyond doubt that AI models are now tools of global and national strategic consequence." 


  "The government should have the power to block or deter deployment of the model if it is determined, in light of third-party assessment, to present unacceptable risks. This power must be scoped to the above four specific risks and there must be protective measures against political favoritism or arbitrary decisions"

https://darioamodei.com/post/policy-on-the-ai-exponential

A third-party demonstrated that it was possible to jailbreak the safety measures of Fable to access the raw Mythos abilities. Abilities which Anthropic say are too dangerous for the public.

Edit. From David Sacks:

  — A highly credible trusted partner of both Anthropic and the USG who was testing Fable came forward with a jailbreak of those guardrails. The Admin asked Dario to fix the jailbreak or de-deploy the model. Dario refused.

   — In their blog post, Anthropic defended its decision by saying the jailbreak isn’t serious. That is not what the trusted partner and the USG believe; nor is that kind of minimizing language consistent with Anthropic’s brand as the AI safety company. It’s difficult to fathom how they could claim a jailbreak allowing operability of a cyber weapon could be defined as not “serious".

New comment by irthomasthomas in "Statement on US government directive to suspend access to Fable 5 and Mythos 5"

irthomasthomas — Sat, 13 Jun 2026 09:42:05 +0000

It should be easy for a company like Anthropic to prove this beyond a doubt. Why don't they? Why don't they have a collection of prompts and side-by-side comparisons with other models showing how far ahead they are?

New comment by irthomasthomas in "Kimi K2.7-Code: open-source coding model with better token efficiency"

irthomasthomas — Fri, 12 Jun 2026 18:26:33 +0000

according to this opencode and cursor cli perform better than claude code: https://x.com/kunchenguid/status/2065345999682568593

New comment by irthomasthomas in "MiMo Code is now released and open-source"

irthomasthomas — Thu, 11 Jun 2026 17:06:41 +0000

I am experimenting with LFM2.5-8B-1A and getting 250tps on a 3060

New comment by irthomasthomas in "DiffusionGemma: 4x Faster Text Generation"

irthomasthomas — Wed, 10 Jun 2026 22:25:27 +0000

I have had a better experience with my own use. I use it every day and it rarely fails to improve tasks. Perhaps the prompts and rubrics make a difference. And finding bugs is one of the better use cases because it is essentially a search problem. As long as models are non-deterministic and there is some diversity in training data, then an ensemble that iterates on the problem is more likely to cover the ground needed to find solve a problem.

Some tasks benefit from this approach more than others. There was a paper from google on a version they made which was very similar and achieved SOTA then on planning and pathfinding benchmarks.

edit:

Mind Evolution paper https://deepmind.google/research/publications/122391/

(That was a month after I published llm-consortium :) https://xcancel.com/karpathy/status/1870692546969735361

New comment by irthomasthomas in "DiffusionGemma: 4x Faster Text Generation"

irthomasthomas — Wed, 10 Jun 2026 21:54:04 +0000

Mercury-2 is amazing. I am using it frequently as the arbiter in llm-consortium The context window is relatively small, so to make it work with larger consortiums I can construct a recursive sort-of meta consortium like this:

  llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-kimi -m k2.6 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --arbiter mercury-2 --judging-method synthesis

Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners.

New comment by irthomasthomas in "AWS Bedrock to require sharing data with Anthropic for Mythos and future models"

irthomasthomas — Wed, 10 Jun 2026 11:54:10 +0000

Is it a larger model or just better trained? Anthropic does not actually claim it is a larger model anywhere that I can see.

New comment by irthomasthomas in "Claude Fable 5"

irthomasthomas — Tue, 09 Jun 2026 20:57:54 +0000

Then it would be slower.

New comment by irthomasthomas in "Claude Fable 5"

irthomasthomas — Tue, 09 Jun 2026 19:20:32 +0000

"we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design).

...

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."

New comment by irthomasthomas in "Claude Fable 5"

irthomasthomas — Tue, 09 Jun 2026 18:27:52 +0000

This is just the sales team doing their thing, applying the Law of Scarcity to drive demand.

It's the same exact speed as opus >=4.5, sonnet 4.5, and twice the speed of opus <=4.1

It must have about the same active parameters, or else its a larger model running in turbo mode (smaller batches) and being heavily subsidized for some reason. But given most of the benchmarks are within 5% I doubt it is a much larger model. Most perplexing.

New comment by irthomasthomas in "Claude Fable 5"

irthomasthomas — Tue, 09 Jun 2026 17:42:45 +0000

Anthropic has again changed SWE-bench Pro 80.3 SWE-bench Ver 95.5 Terminal-Bench BrowseComp (Single-Agent) 88.0 BrowseComp (Multi-Agent) 93.3 HLE (No tools) 59.0 - HLE (Tools) CharXiv Reasoning (No tools) 88.9 CharXiv Reasoning (Tools) 93.5 BioMystery Bench (Human) 83.9 BioMystery Bench (Hard) 46.1 OSWorld-Verified CritPt ArxivMath [0] =4.5, sonnet 4.5, and double the speed of opus <=4.1

Mythos 5 Fable 5 MythosPrev Opus 4.8 GPT-5.5 Gemini 3.1 Pro 80 77.8 69.2 58.6 54.2 95 93.9 88.6 - 80.6 88.0 84.3 - 82.7 83.4 - - 87.9 84.3 84.4 85.9 - - 88.5 - - 56.8 49.8 41.4 44.4 64.5 - 64.7 57.9 52.2 51.4 - 86.2 80.5 - - - 92.5 89.9 - - - 82.6 80.4 - - - 29.6 40.0 - - 85.0 85.0 85.4 83.4 78.7 76.2* 28.6 - 20.9 27.1 17.7 - 78.5 68.7 71.8 71.5 64.0 - nator.com/item?id=48312633">https://news.ycombinator.com/item?id=48312633

Edit: Also in the system card... that limit Claude’s effectiveness for requests targeting (for example, on building pretraining pipelines, distributed or ML accelerator design).

...

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, be visible to the user."



New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"
irthomasthomas — Tue, 09 Jun 2026 14:13:17 +0000

I must have confused mythos with opus 4.7. One of their recent model cards confirmed that training flops was under the EO reporting requirement of 10^26 flops.



New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"
irthomasthomas — Tue, 09 Jun 2026 08:49:37 +0000

Why ask me? Anyway, Mythos is not 10T. Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.
Anthropic also confirmed they will not release Mythos, only a "Mythos-class" model, whatever that means.



New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"
irthomasthomas — Tue, 09 Jun 2026 08:40:35 +0000

I should have stressed the symbolic part. Everyone has pivoted to symbolic systems like claude code and codex. They would no invest so heavily in such systems if they thought llms would deliver agi soon.



New comment by irthomasthomas in "Ask HN: What are tools you have made for yourself since the advent of AI?"
irthomasthomas — Mon, 08 Jun 2026 20:57:42 +0000

llm-consortium: prompts multiple models in parallel, loops until confidence_threshold, and iteratively refines a response.
This was inspired by a karpathy tweet [0] and the prototype created using another tool of mine: The LLM Plugin Generator plugin (essentially a curated collection of plugins for simonws llm cli as a few-shot prompt)
The llm-model-gateway companion plugin lets you serve models from the LLM cli as a an openai API. This allows you to use saved consortiums in your various clients as if they where a regular model. Bringing massive parallel reasoning to any workflow.
It occured to me at some time that an collection of parallel LLMs was not really a consortium. A consortium is a group of organizations. A group of groups. To rectify this I added for actual consortiums, where each member of an llm-consortium can itself be a consortium of models. e.g.
llm consortium save cns-glm-n3 -m glm-5.1 -n 3 --arbiter mercury-2
llm consortium save cns-k2-n3 -m kimi-k2.6:3 --arbiter mercury-2
llm consortium save cns-meta-glm-k2 -m cns-k2-n3 -m cns-glm-n3 --arbiter cns-k2-n3
Yes, even the arbiter/judge can be comprised of a consortium of models, bringing parallel reasoning to the task of judging parallel reasoning chains.
Consortiums can also now contain groups of specialists. These custom user-defined expert characters address the prompt from a different perspective. And a Westworld style Attribute matrix can be randomized to inject some more entropy into the process.
[0]https://xcancel.com/karpathy/status/1870692546969735361
Some other llm plugins I vibe coded:
classifai 
 generates labels with approximate confidence derived from logprobs
llm-alias-options 
 saves inference parameters such as reasoning effort with a model alias. (good for setting the provider in openrouter or creating a consortium of high temperature models)
llm-prompt-json 
 adds a --json flag to return the llm logs object (good for getting conversion_id, or reasoning output in scripts)
llm-jina adds support for all  jina AI specialised models and tools like web fetching, embedding and reranking.



New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"
irthomasthomas — Mon, 08 Jun 2026 20:17:14 +0000

No one is bitter lesson pilled anymore. Everyone is pivoting to neurosymbolic systems. It looks like Gary Marcus was right.



New comment by irthomasthomas in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"
irthomasthomas — Mon, 08 Jun 2026 15:58:29 +0000

I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.



New comment by irthomasthomas in "DeepSeek V4 Pro beats GPT-5.5 Pro on precision"
irthomasthomas — Mon, 08 Jun 2026 15:54:38 +0000

Actually, simonw has started saying that after qwen 27B beat Opus 4.7
https://news.ycombinator.com/item?id=48446348



New comment by irthomasthomas in "DeepSeek V4 Pro beats GPT-5.5 Pro on precision"
irthomasthomas — Mon, 08 Jun 2026 15:06:39 +0000

Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.
  "there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models ...
 
  Today, even that loose connection to utility has been broken..." 

https://simonwillison.net/2026/Apr/16/qwen-beats-opus/