Hacker News: spmurrayzzz

New comment by spmurrayzzz in "Hy3"

spmurrayzzz — Thu, 09 Jul 2026 18:25:57 +0000

When orgs/bencmarks claim 1% deviation, in most cases that means measuring perplexity loss on datasets like wikitext or c4. Even if the loss is calculated via KLD or similar, its not a good proxy for whats actually degradaing at the task level across an entire rollout.

And for MoEs, very small amounts of loss can mean you're flipped to entirely different experts (this is also a problem more broadly with numerical stability issues too).

New comment by spmurrayzzz in "Hy3"

spmurrayzzz — Thu, 09 Jul 2026 16:35:53 +0000

I think its good advice to test both on your own evals for sure, but the MoE parameters are already natively FP4 in ds4. Dropping to 2bpw isn't as big of a loss as it seems (and as corroborated by antirez's work).

Its also only 13B active, so your decode speed would be nearly 2x that of Qwen3.6-27B. So there are other latent benefits as well.

New comment by spmurrayzzz in "Please stop the AI confidence theater"

spmurrayzzz — Fri, 03 Jul 2026 15:04:44 +0000

I had a similar reaction to the "I work at an AI company" and finding out it was Dropbox. And I agree with you, they are not in any way an AI company that would be relevant for someone making claims about frontier intelligence.

I'm empathetic to their position though. It is entirely unsurprising to me that someone working in a growth role at Dropbox is unimpressed by the current state of AI relative to its broader claims in the market. They're not working on AI itself nor are they using applied AI where you see the biggest gains (e.g. SWE, ML, data science, etc.).

We still have a significant capability overhang at the frontier for a big chunk of knowledge work task domains, so I think its understandable (given the above selection bias) why someone would think the confidence is overblown. They have a point in their own domain.

New comment by spmurrayzzz in "What happens when you run a CUDA kernel?"

spmurrayzzz — Mon, 29 Jun 2026 21:53:48 +0000

I'm not entirely up to date with the latest batch, but I've reviewed some of the rollouts in the past and my sense is that the models are surprisingly good at getting correct custom kernels in the happy path, but still weak at sustained/shape-robust workloads. Having to deal with writing the full path from scratch compounded by weird memory layouts, odd sizes, routing, unpacking quantized weights, etc. is definitely challenging.

Also, at least a portion of this you could argue is arbitrary and entirely scoped to the eval itself. The fp8 GEMM score could be low simply because one of the shapes is fairly skinny (i.e. not enough math work to keep the compute engine busy for a meaningful amount of time).

New comment by spmurrayzzz in "What happens when you run a CUDA kernel?"

spmurrayzzz — Mon, 29 Jun 2026 16:15:17 +0000

Near-term acquihires are certainly a likely bet I think. But given model progress on related benchmarks like kernelbench [1], I do think a set of more commoditized solutions is also inevitable.

The caveat though is that each new gen of hardware often comes with brand new constraints/features that a given generation of models haven't seen before (e.g. tcgen05 in blackwell was OOD at one point). As the models start to generalize better, this might not be a showstopper, but still an issue at least currently.

[1] https://kernelbench.com/

New comment by spmurrayzzz in "AI OSS tool repo goes archived over night after raising $7.3M Seed"

spmurrayzzz — Sat, 13 Jun 2026 14:44:17 +0000

I used it, but only briefly to evaluate it. It had some overlap with a tool I built myself, was curious if any of the extra features would be useful.

Ultimately I found the data model and UI to be both cumbersome and unintuitive. Langfuse ended up being the observability tool I went with instead over the one I built (and still use today).

New comment by spmurrayzzz in "Open Reproduction of DeepSeek-R1"

spmurrayzzz — Thu, 11 Jun 2026 14:32:09 +0000

One of my favorite code comments of all time is still in the src:

"# TODO: implement a proper validator to compare against ground truth. For now we just check for exact string match on each line of stdout." [1]

This was one of my chief complaints about the entire R1 news cycle, it felt like no one actually read the technical report. They were being heralded for their openness, but they left out the most meaningful details that you'd need to reproduce their work.

[1] https://github.com/huggingface/open-r1/blob/1416fa0cf21595d2...

New comment by spmurrayzzz in "AI is slowing down"

spmurrayzzz — Tue, 09 Jun 2026 17:11:42 +0000

I don't think its smoke and mirrors, though I do have plenty of gripes with how the labs market this product landscape generally speaking.

The newest biggest model can still matter even if you do not run every prompt through it. You'll always have some task where even small amounts of loss are unacceptable and thus you need to make sure frontier intelligence is used for it.

On the router point, yes, routing has some overhead. But the router does not need to run the biggest model to decide which model to use. We've been using tiny classifiers for recommendation engines for ages now, usually on CPU. If routing saves you from sending a large fraction of traffic to the expensive reasoning model, the routing overhead can easily be worth it.

> Also, if there is significant gains from caching, then like.. what are even doing here? Inputting something and then reading cached pieces of text based on their similarity to the input? Kinda like a search engine?

The caching I'm talking about is explicitly the attention/kv cache, so its not input similarity retrieval (that would be more like what you'd use in a RAG/IR system). Prompt caching is generally about reusing already-computed attention scores for repeated prompt prefixes. The idea being you don't recompute the same static system prompt, tool definitions, schemas, long shared context, or repeated boilerplate every time. In more sophisticated systems, you usually store multiple checkpoints so that a small prompt change doesn't result in all-or-nothing hit/miss scenario.

New comment by spmurrayzzz in "AI is slowing down"

spmurrayzzz — Tue, 09 Jun 2026 17:02:50 +0000

> You can't on the one hand say "customers are beginning to understand they can spend less" and on the other hand suggest that this is good for forecasts of revenue.

Sure you can. Just because there is a non-zero amount of margin pressure from the lower tier inference providers does not imply that revenue forecasts ought to be poor. Jevon's Paradox gets oversold in this current cycle, but I do think it's a relevant lens to view this through given how much demand has outpaced capacity.

The argument is that customers learning to spend less per task can be good for the viability of the market (really the total demand) even if it is bad for naive revenue-per-token assumptions. If a workflow goes from economically stupid to economically viable because you route 80% of it to cheaper models and reserve frontier models for the hard cases, that can expand total usage and improve cost per useful outcome.

New comment by spmurrayzzz in "AI is slowing down"

spmurrayzzz — Mon, 08 Jun 2026 18:36:19 +0000

There is a piece of this I agree with. That you do not need to be a deep technical expert to notice that a company is burning cash by overcommitting to capex, or relying on heroic revenue projections that may or may not come to pass.

But that is not the full argument he is making. If the claim is that the labs will not be able to pay their creditors because inference is structurally incapable of becoming profitable, then he absolutely needs to be right about the technical economics of inference.

One part of that is the balance-sheet argument (which already shows insanely good margins). But it also depends on how inference-time compute actually works: routing, batching, kv cache reuse, model segmentation, different latency tiers, etc. Much of those details he's just been straight up wrong about in his writing, so as a result I have to call into question the rest of his reasoning as well (in part to avoid Gell-Mann amnesia).

New comment by spmurrayzzz in "AI is slowing down"

spmurrayzzz — Mon, 08 Jun 2026 18:00:25 +0000

There's examples both in his writing and also in his appearances on podcasts, interviews, etc.

I'll cherry pick a couple:

“When these new models ‘reason,’ they break a user’s input and break into component parts, then run inference on each one of those parts.” [1]

This is not at all how test-time compute works. At best, this is a very loose metaphor that he may have used out of convenience. This might sound a bit pedantic to point out, but this is a very basic thing that he's getting wrong (presumably at least, again it could be that he just used a poor metaphor).

A less pedantic example would be his claims related to gpt-5/chatgpt auto-routing. He argued that having a router means OpenAI can no longer cache static prompts, because the user prompt has to come before the hidden instructions [2]. This is just not at all how this works at inference-time. There is no evidence that the standard approach of system>developer>user instruction hierarchy has changed, the public API and caching docs maintain this.

But even more broadly, it suggests he is reasoning about kv/prefix caching at the wrong level of abstraction. It's true that conventional prefix caching does require a stable prefix, so yes, if you literally put variable user content before the static prompt, you would destroy the cacheability of that static prompt.

But that is exactly why inference systems are designed to preserve reusable prefixes where possible (via checkpointing or similar), and why serving systems care so much about prefix caching. This is also a big part of how disaggregated prefill/decode infra works where cache-aware routing is critical. His argument treats a bad prompt layout as if it were a necessary consequence of routing, rather than an avoidable implementation choice.

A router can read the user request, decide which model path to use, and then construct a normal downstream model call with stable static instructions first and user content later. Treating that as impossible implies a fundamental architectural misunderstanding.

[1] https://www.wheresyoured.at/how-to-argue-with-an-ai-booster/

[2] https://www.wheresyoured.at/how-does-gpt-5-work/

New comment by spmurrayzzz in "AI is slowing down"

spmurrayzzz — Mon, 08 Jun 2026 17:05:37 +0000

He has also consistently demonstrated, at least to me, that he doesn't really understand how inference works from a technical perspective, which weakens much of his core thesis for why there should be a collapse.

I do value having some naysayers in the mix generally, because we do need balanced critique in what is otherwise a very frothy hype cycle. I just don't think he's making sound arguments, and that's even assuming you even agree with his premises in the first place.

My biggest gripe with his napkin math is that he treats inference gross margins as something novel that you can't compare to normal SaaS margins. He's right in part: the constant carousel of R&D costs from model training, related infrastructure buildout, and other adjacent costs required to stay competitive do change the analysis a bit.

But he takes this way too far when he says this is structurally different from normal SaaS margins. The business model definitely doesn't look like Dropbox, but it absolutely looks a lot like AWS, especially early AWS, CDNs, telecom, etc. I can speak to the telecom bit personally, since it's been over half of my professional career as an engineer and, in this specific case, also as a founder. You can have a brutally capital-intensive infra business where profitability depends on utilization, oversubscription, peak-capacity planning, segmentation, and recovering capex over time.

The math he presents gets even more questionable as we see explicit segmentation happening for cost-saving reasons. Many forward-thinking orgs are waking up to the fact that they don't need to use the best, most expensive model for every task. They can route easier tasks to cheaper models, use caching, batch non-urgent workloads, and reserve frontier models for the subset of work that actually needs frontier intelligence. That directly undermines his claim that providers always need to chase frontier intelligence in order to maintain current demand, utilization, and pricing curves.

New comment by spmurrayzzz in "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model"

spmurrayzzz — Thu, 23 Apr 2026 02:15:55 +0000

This depends a bit on your cost sensitivity and what model families you want support for, but Baseten and Fireworks have been my goto.

Currently Baseten has ~610ms TTFT and ~82 tk/s for Kimi K2.6, which is roughly 2x the throughput of GPT-5.4 (per their openrouter stats). GLM 5 is slightly slower on both metrics, but still strong.

New comment by spmurrayzzz in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"

spmurrayzzz — Thu, 12 Feb 2026 17:01:42 +0000

First as an aside, remember that this entire thread is about using local compute. What you're alluding to is some fantasy infinite budget where you have limitless commodity compute. That's not at all the context of this thread.

But disregarding that, this isn't a problem you can solve by turning a knob akin to scaling a stateless k8s cluster.

The whole vertical of distributed RL has been struggling with this for a while. You can in theory just keep adding sandboxes in parallel, but in RLVR you are constrained by 1) the amount of rollout work you can do per gradient update, and 2) the verification and pruning pipeline that gates the reward signal.

You cant just arbitrarily have a large batch size for every rollout phase. Large batches often reduce effective diversity or get dominated by stragglers. And the outer loop is inherently sequential, because each gradient update depends on data generated by a particular policy snapshot. You can parallelize rollouts and the training step internally, but you can’t fully remove the policy-version dependency without drifting off-policy and taking on extra stability headaches.

New comment by spmurrayzzz in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"

spmurrayzzz — Thu, 12 Feb 2026 14:20:41 +0000

> That’s kind of a moot point.

I don't believe it's moot, but I understand your point. The fact that models are memory bandwidth bound does not at all mean that other overhead is insignificant. Your practical delivered throughput is the minimum of compute ceiling, bandwidth ceiling, and all the unrelated speed limits you hit in the stack. Kernel launch latency, Python dispatch, framework bookkeeping, allocator churn, graph breaks, and sync points can all reduce effective speed. There are so many points in the training and inference loop where the model isn't even executing.

> And what are you doing that I/O is a bottleneck?

We do a fair amount of RLVR at my org. That's almost entirely waiting for servers/envs to do things, not the model doing prefill or decode (or even up/down weighting trajectories). The model is the cheap part in wall clock terms. The hard limits are in the verifier and environment pipeline. Spinning up sandboxes, running tests, reading and writing artifacts, and shuttling results through queues, these all create long idle gaps where the GPU is just waiting to do something.

New comment by spmurrayzzz in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"

spmurrayzzz — Wed, 11 Feb 2026 19:52:18 +0000

No I'm saying there are quite a few more bottlenecks than that (I/O being a big one). Even in the more efficient training frameworks, there's per-op dispatch overhead in python itself. All the boxing/unboxing of python objects to C++ handles, dispatcher lookup + setup, all the autograd bookkeeping, etc.

All of the bottlenecks in sum is why you'd never get to 100% MFUs (but I was conceding you probably don't need to in order to get value)

New comment by spmurrayzzz in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"

spmurrayzzz — Wed, 11 Feb 2026 15:54:18 +0000

For inference, even with continuous batching, getting 100% MFUs is basically impossible to do in practice. Even the frontier labs struggle with this in highly efficient infiniband clusters. Its slightly better with training workloads just due to all the batching and parallel compute, but still mostly unattainable with consumer rigs (you spend a lot of time waiting for I/O).

I also don't think the 100% util is necessary either, to be fair. I get a lot of value out of my two rigs (2x rtx pro 6000, and 4x 3090) even though it may not be 24/7 100% MFU. I'm always training, generating datasets, running agents, etc. I would never consider this a positive ROI measured against capex though, that's not really the point.

New comment by spmurrayzzz in "Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model"

spmurrayzzz — Thu, 29 Jan 2026 14:04:15 +0000

Bits per weight, its an average precision across all the weights. When you quantize these models, they don't just used a fixed precision size across all model layers/weights. There's a mix and it varies per quant method. This is why you can get bit precision that arent "real" in a strict computing sense.

e.g. A 4-bit quant can have half the attention and feed forward tensors in Q6, and the rest in Q4. Due to how block-scaling works, those k-quant dtypes (specifically for llama.cpp/gguf) have larger bpw than they suggest in their name. Q4 is around ~4.5 bpw, and Q6 is ~6.5.

New comment by spmurrayzzz in "Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model"

spmurrayzzz — Tue, 27 Jan 2026 21:42:05 +0000

I've tested this myself often (as an aside: I'm in said community, I run 2x RTX Pro 6000 locally, 4x 3090 before that), and I think what you said re: "willing to wait" is probably the difference maker for me.

I can run Minimax 2.1 in 5bpw at 200k context fully offloaded to GPU. The 30-40 tk/s feels like a lifetime for long horizon tasks, especially with subagent delegation etc, but it's still fast enough to be a daily driver.

But that's more or less my cutoff. Whenever I've tested other setups that dip into the single and sub-single digit throughput rates, it becomes maddening and entirely unusable (for me).

New comment by spmurrayzzz in "Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model"

spmurrayzzz — Tue, 27 Jan 2026 14:30:58 +0000

When I've measured this myself, I've never seen a medium-to-long task horizon that would have expert locality such that you wouldn't be hitting the SSD constantly to swap layers (not to say it doesn't exist, just that in the literature and in my own empirics, it doesn't seem to be observed in a way you could rely on it for cache performance).

Over any task that has enough prefill input diversity and a decode phase thats more than a few tokens, its at least intuitive that experts activate nearly uniformly in the aggregate, since they're activated per token. This is why when you do something more than bs=1, you see forward passes light up the whole network.