<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: zackangelo</title><link>https://news.ycombinator.com/user?id=zackangelo</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sat, 18 Apr 2026 07:29:50 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=zackangelo" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by zackangelo in "Qwen3.6-35B-A3B: Agentic coding power, now open to all"]]></title><description><![CDATA[
<p>They are but the IDE needs to be integrated with them.<p>Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.</p>
]]></description><pubDate>Thu, 16 Apr 2026 15:00:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=47794134</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=47794134</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47794134</guid></item><item><title><![CDATA[New comment by zackangelo in "Qwen3.6-35B-A3B: Agentic coding power, now open to all"]]></title><description><![CDATA[
<p>17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.<p>If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).<p>When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.</p>
]]></description><pubDate>Thu, 16 Apr 2026 14:57:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=47794079</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=47794079</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47794079</guid></item><item><title><![CDATA[New comment by zackangelo in "GPU memory snapshots: sub-second startup (2025)"]]></title><description><![CDATA[
<p>This uses Nvidia’s CUDA snapshot API under the hood, but you have to pair it with a host side snapshot as well. Modal uses gVisor for this, which is notoriously high overhead.<p>Does anyone know of a more efficient alternative if you’re running a trusted container?</p>
]]></description><pubDate>Sat, 10 Jan 2026 23:56:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=46571234</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=46571234</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46571234</guid></item><item><title><![CDATA[New comment by zackangelo in "macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt"]]></title><description><![CDATA[
<p>You’re right I misunderstood.<p>I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.<p>You could run pipeline parallel but not sure it’d be that much better than what we already have.</p>
]]></description><pubDate>Fri, 12 Dec 2025 23:26:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=46250310</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=46250310</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46250310</guid></item><item><title><![CDATA[New comment by zackangelo in "macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt"]]></title><description><![CDATA[
<p>Sparks are built for this and actually have Connect-X 7 NICs built in! You just need to get the SFPs for them. This means you can natively cluster them at 200Gbps.</p>
]]></description><pubDate>Fri, 12 Dec 2025 23:05:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=46250135</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=46250135</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46250135</guid></item><item><title><![CDATA[New comment by zackangelo in "macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt"]]></title><description><![CDATA[
<p>No you use tensor parallelism in both cases.<p>The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.<p>EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)</p>
]]></description><pubDate>Fri, 12 Dec 2025 23:00:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=46250099</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=46250099</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46250099</guid></item><item><title><![CDATA[New comment by zackangelo in "Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model"]]></title><description><![CDATA[
<p>What 1T parameter base model have you seen from any of those labs?</p>
]]></description><pubDate>Fri, 07 Nov 2025 03:41:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=45843360</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=45843360</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45843360</guid></item><item><title><![CDATA[New comment by zackangelo in "NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference"]]></title><description><![CDATA[
<p>Wouldn't you be able to test nccl if you had 2 of these?</p>
]]></description><pubDate>Wed, 15 Oct 2025 03:39:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=45587856</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=45587856</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45587856</guid></item><item><title><![CDATA[New comment by zackangelo in "Launch HN: LlamaFarm (YC W22) – Open-source framework for distributed AI"]]></title><description><![CDATA[
<p>Just a bit of feedback:<p>> Instead of one brittle giant, we orchestrate a Mixture of Experts…<p>“mixture of experts” is a specific term of art that describes an architectural detail of a type of transformer model. It’s definitely not using smaller specialized models for individual tasks. Experts in an MoE model are actually routed to on a per token basis, not on a per task or per generation basis.<p>I know it’s tempting to co-opt this term because it would fit nicely for what you’re trying to do but it just adds confusion.</p>
]]></description><pubDate>Wed, 08 Oct 2025 16:48:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=45518142</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=45518142</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45518142</guid></item><item><title><![CDATA[New comment by zackangelo in "Apps SDK"]]></title><description><![CDATA[
<p>Because it depends on how much better “best” is. If it’s only incrementally better than open source models that have other advantages, why would you bother?<p>OpenAI’s moat will only come from the products they built on top. Theoretically their products will be better because they’ll be more vertically integrated with the underlying models. It’s not unlike Apple’s playbook with regard to hardwares and software integration.</p>
]]></description><pubDate>Mon, 06 Oct 2025 20:36:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=45496025</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=45496025</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45496025</guid></item><item><title><![CDATA[New comment by zackangelo in "From multi-head to latent attention: The evolution of attention mechanisms"]]></title><description><![CDATA[
<p>Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.</p>
]]></description><pubDate>Sat, 30 Aug 2025 17:23:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=45076391</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=45076391</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45076391</guid></item><item><title><![CDATA[New comment by zackangelo in "Mosh Mobile Shell"]]></title><description><![CDATA[
<p>I feel a bit silly for not noticing this before. Over the last year or so I've often wondered when ssh added protocol-level support for session resume. I'd open my laptop on a new network and everything would be ready to go. But of course, it's nothing to do with ssh, it's just that I started using tailscale.</p>
]]></description><pubDate>Thu, 28 Aug 2025 17:08:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=45054536</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=45054536</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45054536</guid></item><item><title><![CDATA[New comment by zackangelo in "Writing Speed-of-Light Flash Attention for 5090 in CUDA C++"]]></title><description><![CDATA[
<p>Curious what issues you were having. The kernel should compile natively if you pass nvcc the correct arch flags, although it probably won't take advantage of any new hardware features.</p>
]]></description><pubDate>Sat, 23 Aug 2025 17:41:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=44997661</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=44997661</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44997661</guid></item><item><title><![CDATA[New comment by zackangelo in "Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs"]]></title><description><![CDATA[
<p>GPT-OSS will run even faster on Blackwell chips because of its hardware support for fp4.<p>If anyone is working on training or inference in Rust, I'm currently working on adding fp8 and fp4 support to cudarc[0] and candle[1]. This is being done so I can support these models in our inference engine for Mixlayer[2].<p>[0] <a href="https://github.com/coreylowman/cudarc/pull/449" rel="nofollow">https://github.com/coreylowman/cudarc/pull/449</a>
[1] <a href="https://github.com/huggingface/candle/pull/2989" rel="nofollow">https://github.com/huggingface/candle/pull/2989</a>
[2] <a href="https://mixlayer.com" rel="nofollow">https://mixlayer.com</a></p>
]]></description><pubDate>Thu, 07 Aug 2025 14:10:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=44824676</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=44824676</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44824676</guid></item><item><title><![CDATA[New comment by zackangelo in "SQLx – Rust SQL Toolkit"]]></title><description><![CDATA[
<p>Is something like SeaQuery[0] what you're talking about?<p>[0] <a href="https://github.com/SeaQL/sea-query/">https://github.com/SeaQL/sea-query/</a></p>
]]></description><pubDate>Tue, 29 Jul 2025 06:01:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=44719575</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=44719575</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44719575</guid></item><item><title><![CDATA[New comment by zackangelo in "Qwen3-Coder: Agentic coding in the world"]]></title><description><![CDATA[
<p>Draft model doesn’t degrade quality!</p>
]]></description><pubDate>Wed, 23 Jul 2025 05:56:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=44656086</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=44656086</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44656086</guid></item><item><title><![CDATA[New comment by zackangelo in "Show HN: We made our own inference engine for Apple Silicon"]]></title><description><![CDATA[
<p>We also wrote our inference engine in rust for mixlayer, happy to answer any questions from those trying to do the same.<p>Looks like this uses ndarray and mpsgraph (which I did not know about!), we opted to use candle instead.</p>
]]></description><pubDate>Tue, 15 Jul 2025 21:07:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=44575833</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=44575833</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44575833</guid></item><item><title><![CDATA[New comment by zackangelo in "Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model"]]></title><description><![CDATA[
<p>Typically a combination of expert level parallelism and tensor level parallelism is used.<p>For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).</p>
]]></description><pubDate>Fri, 11 Jul 2025 17:45:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=44535051</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=44535051</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44535051</guid></item><item><title><![CDATA[New comment by zackangelo in "Launch HN: Morph (YC S23) – Apply AI code edits at 4,500 tokens/sec"]]></title><description><![CDATA[
<p>For anyone more curious about how this works, Fireworks wrote a blog post about it last year (I think):<p><a href="https://fireworks.ai/blog/cursor" rel="nofollow">https://fireworks.ai/blog/cursor</a></p>
]]></description><pubDate>Mon, 07 Jul 2025 16:32:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=44492033</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=44492033</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44492033</guid></item><item><title><![CDATA[New comment by zackangelo in "Life of an inference request (vLLM V1): How LLMs are served efficiently at scale"]]></title><description><![CDATA[
<p>In your forward pass section you give a lot of emphasis to FlashAttention, but it might be worth mentioning Paged Attention as well (which was the paper written by the vLLM authors and I believe was the genesis of the project). PA-style block tables are now supported in most fused attention kernels, but vLLM originally came up with it and it's the main reason why vLLM has such high throughput!</p>
]]></description><pubDate>Sun, 29 Jun 2025 16:39:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=44414390</link><dc:creator>zackangelo</dc:creator><comments>https://news.ycombinator.com/item?id=44414390</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44414390</guid></item></channel></rss>