<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: spmurrayzzz</title><link>https://news.ycombinator.com/user?id=spmurrayzzz</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Thu, 21 May 2026 04:21:10 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=spmurrayzzz" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by spmurrayzzz in "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model"]]></title><description><![CDATA[
<p>This depends a bit on your cost sensitivity and what model families you want support for, but Baseten and Fireworks have been my goto.<p>Currently Baseten has ~610ms TTFT and ~82 tk/s for Kimi K2.6, which is roughly 2x the throughput of GPT-5.4 (per their openrouter stats). GLM 5 is slightly slower on both metrics, but still strong.</p>
]]></description><pubDate>Thu, 23 Apr 2026 02:15:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=47871618</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=47871618</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47871618</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"]]></title><description><![CDATA[
<p>First as an aside, remember that this entire thread is about using local compute. What you're alluding to is some fantasy infinite budget where you have limitless commodity compute. That's not at all the context of this thread.<p>But disregarding that, this isn't a problem you can solve by turning a knob akin to scaling a stateless k8s cluster.<p>The whole vertical of distributed RL has been struggling with this for a while. You can in theory just keep adding sandboxes in parallel, but in RLVR you are constrained by 1) the amount of rollout work you can do per gradient update, and 2) the verification and pruning pipeline that gates the reward signal.<p>You cant just arbitrarily have a large batch size for every rollout phase. Large batches often reduce effective diversity or get dominated by stragglers. And the outer loop is inherently sequential, because each gradient update depends on data generated by a particular policy snapshot. You can parallelize rollouts and the training step internally, but you can’t fully remove the policy-version dependency without drifting off-policy and taking on extra stability headaches.</p>
]]></description><pubDate>Thu, 12 Feb 2026 17:01:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=46991356</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46991356</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46991356</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"]]></title><description><![CDATA[
<p>> That’s kind of a moot point.<p>I don't believe it's moot, but I understand your point. The fact that models are memory bandwidth bound does not at all mean that other overhead is insignificant. Your practical delivered throughput is the minimum of compute ceiling, bandwidth ceiling, and all the unrelated speed limits you hit in the stack. Kernel launch latency, Python dispatch, framework bookkeeping, allocator churn, graph breaks, and sync points can all reduce effective speed. There are so many points in the training and inference loop where the model isn't even executing.<p>> And what are you doing that I/O is a bottleneck?<p>We do a fair amount of RLVR at my org. That's almost entirely waiting for servers/envs to do things, not the model doing prefill or decode (or even up/down weighting trajectories). The model is the cheap part in wall clock terms. The hard limits are in the verifier and environment pipeline. Spinning up sandboxes, running tests, reading and writing artifacts, and shuttling results through queues, these all create long idle gaps where the GPU is just waiting to do something.</p>
]]></description><pubDate>Thu, 12 Feb 2026 14:20:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=46989170</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46989170</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46989170</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"]]></title><description><![CDATA[
<p>No I'm saying there are quite a few more bottlenecks than that (I/O being a big one). Even in the more efficient training frameworks, there's per-op dispatch overhead in python itself. All the boxing/unboxing of python objects to C++ handles, dispatcher lookup + setup, all the autograd bookkeeping, etc.<p>All of the bottlenecks in sum is why you'd never get to 100% MFUs (but I was conceding you probably don't need to in order to get value)</p>
]]></description><pubDate>Wed, 11 Feb 2026 19:52:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=46979951</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46979951</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46979951</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"]]></title><description><![CDATA[
<p>For inference, even with continuous batching, getting 100% MFUs is basically impossible to do in practice. Even the frontier labs struggle with this in highly efficient infiniband clusters. Its slightly better with training workloads just due to all the batching and parallel compute, but still mostly unattainable with consumer rigs (you spend a lot of time waiting for I/O).<p>I also don't think the 100% util is necessary either, to be fair. I get a lot of value out of my two rigs (2x rtx pro 6000, and 4x 3090) even though it may not be 24/7 100% MFU. I'm always training, generating datasets, running agents, etc. I would never consider this a positive ROI measured against capex though, that's not really the point.</p>
]]></description><pubDate>Wed, 11 Feb 2026 15:54:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=46976496</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46976496</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46976496</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model"]]></title><description><![CDATA[
<p>Bits per weight, its an average precision across all the weights. When you quantize these models, they don't just used a fixed precision size across all model layers/weights. There's a mix and it varies per quant method. This is why you can get bit precision that arent "real" in a strict computing sense.<p>e.g. A 4-bit quant can have half the attention and feed forward tensors in Q6, and the rest in Q4. Due to how block-scaling works, those k-quant dtypes (specifically for llama.cpp/gguf) have larger bpw than they suggest in their name. Q4 is around ~4.5 bpw, and Q6 is ~6.5.</p>
]]></description><pubDate>Thu, 29 Jan 2026 14:04:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=46810348</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46810348</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46810348</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model"]]></title><description><![CDATA[
<p>I've tested this myself often (as an aside: I'm in said community, I run 2x RTX Pro 6000 locally, 4x 3090 before that), and I think what you said re: "willing to wait" is probably the difference maker for me.<p>I can run Minimax 2.1 in 5bpw at 200k context fully offloaded to GPU. The 30-40 tk/s feels like a lifetime for long horizon tasks, especially with subagent delegation etc, but it's still fast enough to be a daily driver.<p>But that's more or less my cutoff. Whenever I've tested other setups that dip into the single and sub-single digit throughput rates, it becomes maddening and entirely unusable (for me).</p>
]]></description><pubDate>Tue, 27 Jan 2026 21:42:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=46787341</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46787341</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46787341</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model"]]></title><description><![CDATA[
<p>When I've measured this myself, I've never seen a medium-to-long task horizon that would have expert locality such that you wouldn't be hitting the SSD constantly to swap layers (not to say it doesn't exist, just that in the literature and in my own empirics, it doesn't seem to be observed in a way you could rely on it for cache performance).<p>Over any task that has enough prefill input diversity and a decode phase thats more than a few tokens, its at least intuitive that experts activate nearly uniformly in the aggregate, since they're activated per token. This is why when you do something more than bs=1, you see forward passes light up the whole network.</p>
]]></description><pubDate>Tue, 27 Jan 2026 14:30:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=46780428</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46780428</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46780428</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Gas Town Decoded"]]></title><description><![CDATA[
<p>Preferences I think I get, but prejudices?<p>The OED defines prejudice as a "preconceived opinion that is not based on reason or actual experience."<p>My day to day work involves: full stack web dev, distributed systems, embedded systems, and machine learning. In addition to using AI tooling for dev tasks, we also use agents in production for various workflows and we also train/finetune models (some LLMs, but also other types of neural networks for anomaly detection, fault localization, time series forecasting, etc). I am basing my original commentary in this thread on all of that cumulative experience.<p>It has been my observation over the last almost 30 years of being a professional SWE that full stack web dev has been much easier and simpler than the other domains I work in. And even further, I find that models are much better at that domain on average than the other domains, measured by pass@k scores on private evals representing each domain. Anecdotal experience also tends to match the evals.<p>This tracks with all the other information we have pertaining to benchmark saturation, the "we need harder evals" crowd has been ringing this bell for the last 8-12 months. Models are getting very good at the less complex tasks.<p>I don't believe it will remain that way forever, but at present its far more common to see someone one shot a full stack web app from a single prompt than something like kernel driver for a NIC. One class of devs is seeing a massive performance jump, another class is not.<p>I don't see how that can be perceived as prejudice, it just may be an opinion you don't agree with or an observation that doesn't match your own experience (both of which are totally valid and understandable).</p>
]]></description><pubDate>Tue, 20 Jan 2026 15:10:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=46692584</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46692584</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46692584</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Gas Town Decoded"]]></title><description><![CDATA[
<p>I think you're shadow-boxing with a point I never made. I never said experienced devs are not or can not be excited about current AI capabilities.<p>Lots of experienced devs who work in more difficult domains are excited about AI. In fact, I am one of them (see one of my responses in this thread about gpt-oss being able to work on proprietary RF firmware in my company [1]).<p>But that in no way suggests that there isn't a gap in what impresses or surprises engineers across any set of domains. Antirez is probably one of the better, more reasoned examples of this.<p>[1] <a href="https://news.ycombinator.com/item?id=46682604">https://news.ycombinator.com/item?id=46682604</a></p>
]]></description><pubDate>Mon, 19 Jan 2026 20:55:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=46684381</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46684381</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46684381</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Gas Town Decoded"]]></title><description><![CDATA[
<p>My comment above I hope wasn't read to mean "LLMs are only good at web dev." Only that there are different capability magnitudes.<p>I often do experiments where I will clone one of our private repos, revert a commit, trash the .git path, and then see if any of the models/agents can re-apply the commit after N iterations. I record the pass@k score and compare between model generations over time.<p>In one of those recent experiments, I saw gpt-oss-120b add API support to swap tx and rx IQ for digital spectral inversion at higher frequencies on our wireless devices. This is for a proprietary IC running a quantenna radio, the SDK of which is very likely not in-distribution. It was moderately impressive to me in part because just writing the IQ swap registers had a negative effect on performance, but the model found that swapping the order of the IQ imbalance coefficients fixed the performance degradation.<p>I wouldn't say this was the same level of "impressive" as what the hype demands, but I remain an enthusiastic user of AI tooling due to somewhat regular moments like that. Especially when it involves open weight models of a low-to-moderate param count. My original point though is that those moments are far more common in web dev than they are elsewhere currently.<p>EDIT: Forgot to add that the model also did some work that the original commit did not. It removed code paths that were clobbering the rx IQ swap register and instead changed it to explicitly initialize during baseband init so it would come up correct on boot.</p>
]]></description><pubDate>Mon, 19 Jan 2026 18:22:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=46682604</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46682604</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46682604</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Gas Town Decoded"]]></title><description><![CDATA[
<p>This has also been an interesting social experiment in that we get to see what work people think is actually impressive vs trivial.<p>Folks who have spent years effectively snapping together other people’s APIs like LEGOs (and being well-compensated for it) are understandably blown away by the current state of AI. Compare that to someone writing embedded firmware for device microcontrollers, who would understandably be underwhelmed by the same.<p>The gap in reactions says more about the nature of the work than it does about the tools themselves.</p>
]]></description><pubDate>Mon, 19 Jan 2026 16:43:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=46681099</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46681099</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46681099</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "LLM from scratch, part 28 – training a base model from scratch on an RTX 3090"]]></title><description><![CDATA[
<p>Those cards can be great for lots of use cases, plenty of small models are very capable at the param counts which can fit in 32GB of VRAM. GPT-OSS-20B for example is a serviceable model for agentic coding use cases and it runs natively in MXFP4. So it fits comfortably on a 5090 at full 128k context. It also has enough headroom to do PEFT-style SFT or RL.<p>But given the high entry cost and depending on the cost of electricity in your area, it would take a number of years to amortize both the initial purchase of the card in addition to the energy cost of the compute (comparing to the compute-equivalent hourly cloud rental costs).<p>For context, a single 5090 rented via Runpod is currently $0.69/hr USD on-demand. Cost range on Amazon right now for a new card is running between $3200-3700 USD. Just using the raw capex alone, that's ~5k hours of GPU compute assuming you pay only on-demand. Thats 2-3 years worth of compute if you assume compute saturation for normal working hour durations. This is before you account for the cost of power, which in my city could run you upwards of $140/mo varying by season.<p>With that said, I have a bunch of ML servers that I built for myself. The largest one is using 2x RTX Pro 6000s and have been very happy with it. If I was only doing inference I think this would be a somewhat questionable expense, setting aside the valid motivations that some folks have related to data privacy and security. But I do a lot of finetuning and maintain private/local eval harnesses that personally for me have made it worth the investment.</p>
]]></description><pubDate>Tue, 09 Dec 2025 20:19:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=46210085</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46210085</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46210085</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Olmo 3: Charting a path through the model flow to lead open-source AI"]]></title><description><![CDATA[
<p>There is an older paper on something related to this [1], where the model outputs reflection tokens that either trigger retrieval or critique steps. The idea is that the model recognizes that it needs to fetch some grounding subsequent to generating some factual content. Then it reviews what it previously generated with the retrieved grounding.<p>The problem with this approach is that it does not generalize well at all out of distribution. I'm not aware of any follow up to this, but I do think it's an interesting area of research nonetheless.<p>[1] <a href="https://arxiv.org/abs/2310.11511" rel="nofollow">https://arxiv.org/abs/2310.11511</a></p>
]]></description><pubDate>Fri, 21 Nov 2025 16:53:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=46006191</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=46006191</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46006191</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Survey: a third of senior developers say over half their code is AI-generated"]]></title><description><![CDATA[
<p>I generally agree with this just from a perspective of personal sentiment, it does feel wrong.<p>But statistically speaking, at a 95% confidence level you'd be within a +/- 3.5% margin of error given the 791 sample size, irrespective of whether the population is 30k or 30M.</p>
]]></description><pubDate>Mon, 01 Sep 2025 00:45:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=45088403</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=45088403</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45088403</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Crush: Glamourous AI coding agent for your favourite terminal"]]></title><description><![CDATA[
<p>> I spent at least an hour trying to get OpenCode to use a local model and then found a graveyard of PRs begging for Ollama support<p>Almost from day one of the project, I've been able to use local models. Llama.cpp worked out of the box with zero issues, same with vllm and sglang. The only tweak I had to make initially was manually changing the system prompt in my fork, but now you can do that via their custom modes features.<p>The ollama support issues are specific to that implementation.</p>
]]></description><pubDate>Wed, 30 Jul 2025 18:06:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=44737553</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=44737553</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44737553</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "LLM Embeddings Explained: A Visual and Intuitive Guide"]]></title><description><![CDATA[
<p>Ah didn't realize you were referring to NoPE explicitly. And yea, the intuitions gained from that paper are pretty much what I alluded to above, you don't get away with never modeling the positional data, the question is how you model it effectively and from where do you derive that signal.<p>NoPE never really took off more broadly in modern architecture implementations. We haven't seen anyone successfully reproduce the proposed solution to the long context problem presented in the paper (tuning the scaling factor in the attention softmax).<p>There is a recent paper back in December[1] that talked about the idea of positional information arising from the similarity of nearby embeddings. Its again in that common research bucket of "never reproduced", but interesting. It does sound similar in spirit though to the NoPE idea you mentioned of the causal mask providing some amount of position signal. i.e. we don't necessarily need to adjust the embeddings explicitly for the same information to be learned (TBD on whether that proves out long term).<p>This all goes back to my original comment though of communicating this idea to AI/ML neophytes being challenging. I don't think skipping the concept of positional information actually makes these systems easier to comprehend since its critically important to how we model language, but its also really complicated to explain in terms of implementation.<p>[1] <a href="https://arxiv.org/abs/2501.00073" rel="nofollow">https://arxiv.org/abs/2501.00073</a></p>
]]></description><pubDate>Wed, 30 Jul 2025 14:54:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=44735025</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=44735025</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44735025</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "LLM Embeddings Explained: A Visual and Intuitive Guide"]]></title><description><![CDATA[
<p>Do you have a citation for the paper on that? IME, that's not really something you see used in practice, at least not after 2022 or so. Without some form positional adjustment, transformer-based LLMs have no way to differentiate from "The dog bit the man." and "The man bit the dog." given the token ids are nearly identical. You just end up back in the bag-of-words problem space. The self-attention mechanism is permutation-invariant, so as long as it remains true that the attention scores are computed as an unordered set, you need some way to model the sequence.<p>Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well.<p>EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely <a href="https://arxiv.org/abs/2507.07955" rel="nofollow">https://arxiv.org/abs/2507.07955</a></p>
]]></description><pubDate>Tue, 29 Jul 2025 14:24:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=44723770</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=44723770</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44723770</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "LLM Embeddings Explained: A Visual and Intuitive Guide"]]></title><description><![CDATA[
<p>Very much agree re: inscrutability. It gets even more complicated when you add the LLM-specific concept of rotary positional embeddings to the mix. In my experience, it's been exceptionally hard to communicate that concept to even technical folks that may understand (at a high level) the concept of semantic similarity via something like cosine distance.<p>I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).</p>
]]></description><pubDate>Mon, 28 Jul 2025 17:16:53 +0000</pubDate><link>https://news.ycombinator.com/item?id=44712975</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=44712975</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44712975</guid></item><item><title><![CDATA[New comment by spmurrayzzz in "Show HN: RULER – Easily apply RL to any agent"]]></title><description><![CDATA[
<p>Might end up being some confusion with the RULER benchmark from NVIDIA given the (somewhat shared) domain: <a href="https://github.com/NVIDIA/RULER">https://github.com/NVIDIA/RULER</a><p>EDIT: by shared I only mean the adjacency to LLMs/AI/ML, RL is a pretty big differentiator though and project looks great</p>
]]></description><pubDate>Fri, 11 Jul 2025 21:49:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=44537103</link><dc:creator>spmurrayzzz</dc:creator><comments>https://news.ycombinator.com/item?id=44537103</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44537103</guid></item></channel></rss>