Hacker News: Chamix

New comment by Chamix in "No, it doesn't cost Anthropic $5k per Claude Code user"

Chamix — Fri, 13 Mar 2026 21:59:17 +0000

I appreciate the detailed comment! I took the day off and am bored so have a brain dump of a reply - basically I think we are talking past each other on two major points:

1. All the discussion about model size is CRITICALLY bisected into talking about TOTAL model size vs ACTIVE parameter size (of a "head" in an "Mixture of Experts"). Everything you've said trend-wise is mostly accurate for ACTIVE parameter count, which is what determines inference cost and speed.

But I am primarily talking about TOTAL parameter count (which has to just fit inside cluster HBM). The total parameter count only affects training cost and has nothing to do with inference cost or speed. So there is no downside to making total parameter count as big as your inference cluster will fit.

2. You touch on distllation, and this heavily relates to the post-gpt-4 base model (call it 5th gen, if gpt-4 was 4th gen), which indeed was used for all models through gpt5.1.

The actual base 5th gen model was as large as OAI could fit on training clusters, and only then distilled down to whatever total size a release model targeted, and the little secret with sparse MOE is the entire model weights don't have to fit (again, plenty of public papers detailing techniques) on a single HBM pool when training. This leads to the 2nd little secret, that GPT-4.5 is ALSO using that same base model; as I said in another comment, 4.5 was all an experiment in testing a huge ACTIVE parameter model (which again is all that determines cost and speed), not so much total (which is capped by inference cluster hardware anyways!) How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else? But its easy to serve a model with active parameters 10x bigger!

So this same huge 5th gen base model was distilled down and RLed over and over again in different permutations and sizes to feed the whole OAI model lineup, from o4-mini to advanced voice to gpt4.5 all the way until finally 5.2 starts using a new, "6th gen" base model (with various failed base model trainings between 5th and 6th) (shallotpeat!).

Picking up misc pieces, yes 4o was tiny when served at Q4, which is what Maia 100 did (with some Q6). We are still taking about a ~1T total model. Quantization both static and dynamic was the whole drive behind gpt4-turbo variants which led straight into 4o targeting an extremely economical deployment of 5th gen base. Economical was sorely needed (arrakis!) since this all was at the critical junction when 8xH100s had not been deployed quite at scale yet, but AI use was rocketing off to mainstream, so we had silly situations like Azure being forced to serve on 256gb clusters. (We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)

But this DOES NOT mean o1 was tiny, which conveniently was deployed right when 8xH100s WERE available at scale. We split into the instant tree, where 4.1 was bigger than 4o and 5-instant was bigger than 4.1 etc. And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking. Again, the ACTIVE counts were very small comparatively, especially as it let you cheaply experiment and then train with substantial inference compute required for RL training/unrolling! But there was no reason not to fit increasingly large distilled versions of the 5th-gen/6th-gen base models as the inference fleet buildouts (particularly in 2H 2025) came online! The same 5th and now 6th gen base models were refined and twisted (foundry!) into totally different end models and sizes.

I just think this really all comes down to total vs active, not understanding a huge base model can be distilled into arbitrarily sized release models, and then bizarrely giving weight to Meta's completely incompetent Llama 4 training run (I was there, Gandalf!) as giving any sort of insight on what sort of sparsity ratio cutting edge labs are using. You cannot learn anything about total parameter size from active parameter count+ derivatives (token speed, cost, etc)! But on this topic we could again diverge into an entire debate; I'll just say Google is likely doing like 0.1%-OOM in some production configs (Jim Keller is basically shouting extreme sparsity from the rooftops!).

Brief rebuttal summary:

1. Incorrect as of late 2025. Whole public reporting about Anthropic dissatisfaction with "Project Ranier". Dario talking about Nvidia compute candidly on Dwarkish interview!

2. Active vs Total

3. 4o is small, 4-bit 4o on Azure even smaller. 4o is 5th gen base distilled not gpt-4 distilled.

4. 256gb at Q4 fits 1T parameters! Active vs total

5. 5th gen pretrain / base model is huge! 4.5 uses the same base as 4o and 5.1! Can be shrunk to arbitrary size before RL/post training create finished model! Active vs total

6. Active vs total

7. Active vs total, also Ironwood/TPUv7 and Blackwell give much cheaper Q4 inference

8. Don't trust the Zuck

Anyways its all a mess and I don't think its possible to avoid talking past each other or misunderstanding in semi-casual conversation - even just today Dylan Patel (who is extremely well informed!) was on Dwarkesh podcast talking about 5.4-instant having a smaller active parameter count than GPT-4 (220B active), which is completely true, but instantly gets misinterpreted on twitter et al that 5.4 is a smaller model than gpt-4, ignores that 5-4.instant are 5.4-thinking are totally different models, etc etc, just too much nuance to easily convey.

New comment by Chamix in "No, it doesn't cost Anthropic $5k per Claude Code user"

Chamix — Tue, 10 Mar 2026 18:41:41 +0000

Sorry if that was unclear, I did mean 100Bs as in the next order of magnitude. Even GPT-4 had ~220B active params, though the trend has been towards increased sparsification (lower activation:total ratio). GPT 4.5 is the only publicly facing model that approached 1T active parameters (an experiment to see if there was any value in the extreme inference cost of quadratically increasing compute cost with naïve-like attention). Nowadays you optimize your head size to your attention kernel arch and obtain performance principally through inference time scaling (generate more of tokens) and parallel consensus (gpt pro, gemini deep think etc), both of which favor faster, cheaper active heads.

4o and other H100 era models did indeed drop their activated heads far smaller than gpt-4 to the 10s just like current Hopper-Era Chinese open-source, but it went right back up again post-Blackwell with the 10x L2 bump (for kv cache) in congruence with nlogn attention mechanisms being refined. Similar story for Claude.

The fun speculation is wondering about the true size of Gemini 3's internals, given the petabyte+ world size of their homefield IronwoodV7 systems and Jim Keller's public penchant for envisioning extreme MoE-like diversification across hundreds of dedicated sub-models constructed by individual teams within DeepMind.

New comment by Chamix in "No, it doesn't cost Anthropic $5k per Claude Code user"

Chamix — Tue, 10 Mar 2026 16:59:49 +0000

What do you think labs are doing with the minimum 10TB memory in NvLink 72 systems that were publicly reported to all start coming online in November/December of last year? And why would this 1 TB -> 10 TB jump matter so much for Anthropic previously being wholly dependent on running Opus 4x on TPUs, if the models were 2-3T at 4bit and could fit in 8x B200 (1.5 TB = 3T param) widely deployed during the Opus 4 era?

You have presented a vibe-based rebuttal with no evidence or or logic to outline why you think labs are still stuck in the single trillions of parameters (GPT 4 was ~1 trillion params!). Though, you have successfully cunninghammed me into saying that while anything I publicly state is derived from public info, working in the industry itself is a helpful guide to point at the right public info to reference.

New comment by Chamix in "No, it doesn't cost Anthropic $5k per Claude Code user"

Chamix — Tue, 10 Mar 2026 07:17:43 +0000

I assure you, the number of people paying to use Qwen3-Max or other similar proprietary endpoints is far less than 1.6 billion.

New comment by Chamix in "No, it doesn't cost Anthropic $5k per Claude Code user"

Chamix — Tue, 10 Mar 2026 06:44:30 +0000

I generally agree, back of the napkin math shows H20 cluster of 8gpu * 96gb = 768gb = 768B parameters on FP8 (no NVFP4 on Hopper), which lines up pretty nicely with the sizes of recent open source Chinese models.

However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.

This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.

New comment by Chamix in "No, it doesn't cost Anthropic $5k per Claude Code user"

Chamix — Tue, 10 Mar 2026 06:21:47 +0000

Try 10s of trillions. These days everyone is running 4-bit at inference (the flagship feature of Blackwell+), with the big flagship models running on recently installed Nvidia 72gpu rubin clusters (and equivalent-ish world size for those rented Ironwood TPUs Anthropic also uses). Let's see, Vera Rubin racks come standard with 20 TB (Blackwell NVL72 with 10 TB) of unified memory, and NVFP4 fits 2 parameters per btye...

Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)

So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.

New comment by Chamix in "Claude’s C Compiler vs. GCC"

Chamix — Mon, 09 Feb 2026 05:50:36 +0000

You, know, it sure does add some additional perspective to the original Anthropic marketing materia... ahem, I mean article, to learn that the CCC compiled runtime for SQLite could potentially run up to 158,000 times slower than a GCC compiled one...

Nevertheless, the victories continue to be closer to home.

New comment by Chamix in "GPT-4.5"

Chamix — Thu, 27 Feb 2025 22:24:48 +0000

Indeed, and the difference could in essence be achieved yourself with a different system prompt on 4o. What exactly is 4.5 contributing here in terms of a more nuanced intelligence?

The new RLHF direction (heavily amplified through scaling synthetic training tokens) seems to clobber any minor gains the improved base internet prediction gains might've added.

New comment by Chamix in "GPT-4.5"

Chamix — Thu, 27 Feb 2025 22:09:56 +0000

It's interesting to compare the cost of that original gpt-4 32k(0314) vs gpt-4.5:

$60/M input tokens vs $75/M input tokens

$120/M output tokens vs $150/M output tokens

New comment by Chamix in "Sohu – first specialized chip (ASIC) for transformer models"

Chamix — Sat, 29 Jun 2024 04:09:25 +0000

Forgive me if I'm missing your existing realization (I did a quick check of your HN, reddit, twitter, LW), but I think the big deal with Sohu (wrt Etched) is that they have pivoted from the "all model parameters hard etched onto the chip" to "only transformer(matmul etc) ops etched onto the chip".

Soho does not have the LLaMA 70b weights directly lithographed onto the silicon, as you seem? to be implying with attachment to that 6month old post.

Seems like a sensible pivot; I'd imagine they're rather up to date on the pulse of dynamically updated nets potentially being a major feature in upcoming frontier models, as you've recently been commentating on. However, I'm not deep enough in it to be sure how much this removes their differentiation vs other AI accelerator startups.

New comment by Chamix in "Show HN: I made an open-source Loom alternative"

Chamix — Mon, 13 May 2024 15:53:59 +0000

I was thinking about the llm writing tool from Janus.

New comment by Chamix in "DBRX: A new open LLM"

Chamix — Thu, 28 Mar 2024 04:13:06 +0000

4chan already has a torrent out, of course.

New comment by Chamix in "Sam Altman, Greg Brockman and others to join Microsoft"

Chamix — Mon, 20 Nov 2023 08:43:50 +0000

The little secret is that the training run (meaning, creating the raw autocompleting multimodal token weights) for 5 ran in parallel with 4.

New comment by Chamix in "Update on the OpenAI drama: Altman and the board had till 5pm to reach a truce"

Chamix — Sun, 19 Nov 2023 03:40:06 +0000

Luckily Eliezer has written hundreds of approachable essays on the development of his epistemic processes over at lesswrong.com so you too can learn rationality and derive the killeveryonism conclusion yourself.

(/s since this is the internet)

New comment by Chamix in "Details emerge of surprise board coup that ousted CEO Sam Altman at OpenAI"

Chamix — Sun, 19 Nov 2023 00:10:09 +0000

Fair enough, shame "Large Tokenized Models" etc never entered the nomenclature.

New comment by Chamix in "Details emerge of surprise board coup that ousted CEO Sam Altman at OpenAI"

Chamix — Sat, 18 Nov 2023 19:59:11 +0000

You are conflating Illya's belief in the transformer architecture (with tweaks/compute optimizations) being sufficient for AGI with that of LLMs being sufficient to express human-like intelligence. Multi-modality (and the swath of new training data it unlocks) is clearly a key component of creating AGI if we watch Sutskever's interviews from the past year.

New comment by Chamix in "GPT4 is 8 x 220B params = 1.7T params"

Chamix — Wed, 21 Jun 2023 04:23:16 +0000

The issue, as pointed above, is primarily bandwidth (at inference), not addressable memory. Put simply, the best bandwidth stack we currently have is on-package HBM -> NVLink, -> Mellanox InfiniBand, and for inference speed you really can't leave the NVLink bandwidth (read, 8x DGX pod) for >100b parameters. And stacking HBM dies is much harder (read, expensive) than GDDR dies which is harder than DDR etc.

Cost aside, HMB dies themselves aren't getting significantly denser anytime soon, and there just simply isn't enough package space with current manufacturing methods to pack a significantly increased number of dies on the gpu.

So I suspect the major hardware jumps will continue to be with NVLink/NVSwitch. Nvlink 4 + NVSwitch 3 actually already allows for up 256x GPUs https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-ho... ; increased numbers of links will let ever increasing numbers of GPUs pool with sufficient bandwidth for inference on larger models.

As already mentioned, see this HN post about the GH200 https://news.ycombinator.com/item?id=36133226, which has some further discussion about the cutting edge of bandwidth for Nvidia DGX and Google TPU pods.

New comment by Chamix in "Apple Vision Pro: Apple’s first spatial computer"

Chamix — Mon, 05 Jun 2023 22:05:50 +0000

Agriculture > Manufacturing -> Service -> Content based economy, turns out youth have the head start, as always.

New comment by Chamix in "Amazon CodeWhisperer, Free for Individual Use, Is Now Generally Available"

Chamix — Thu, 13 Apr 2023 19:11:41 +0000

Used it for weeks internally, finally gave up after deciding it I more than once felt the day's use was a net productive loss even when working with internal Amazon packages (that it should have had a training advantage on vs copilot). Terrible UX and copilot really has just miles more intelligent suggestions.

Tried it for a bit this morning with the "newest" release and didn't immediately observe any improvement, though this is far from objective of course.

New comment by Chamix in "Large Language Models Are Human-Level Prompt Engineers"

Chamix — Tue, 11 Apr 2023 03:05:25 +0000

Spoiler: He did not say that.