<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: gaeld</title><link>https://news.ycombinator.com/user?id=gaeld</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 15 Jun 2026 13:30:45 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=gaeld" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>True, and for third-party models we'll just re-use their public open weights.<p>There is a time-consuming part, though, that is performed manually by our (human) team: implement the logic of the model in C++ and assembly code in a super-optimized way, co-designed for each specific hardware card.<p>This can take months.<p>We hope to accelerate the process with AI agents, but we're not there yet.</p>
]]></description><pubDate>Fri, 29 May 2026 21:24:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=48329489</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48329489</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48329489</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>It also matters for thinking models and for agentic workflows, especially in software engineering, where a lot of tokens need to be output in iterative loops before the user sees any result.<p>This is our main use case.</p>
]]></description><pubDate>Fri, 29 May 2026 19:55:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=48328363</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48328363</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48328363</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>In theory yes, although not in a linearly proportional way, because in practice our memory streaming is not yet perfect. There are still some fixed costs that we did not fully optimize (for now).</p>
]]></description><pubDate>Fri, 29 May 2026 18:55:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=48327687</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48327687</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48327687</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>I'm sure there are, and I really hope we can work on consumer-grade GPUs at some point.<p>It should be possible to apply the same methodology (digging deep into the hardware details to understand all its little characteristics, and rethinking the inference stack around that).</p>
]]></description><pubDate>Fri, 29 May 2026 18:13:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=48327060</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48327060</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48327060</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Note that this coding model is trained on programming use cases, and is also not tuned for multi-turn chat.<p>You can ask it to implement an algorithm; we provide suggested prompts you can test.<p>Also, this tech preview is really about the speed of the inference engine (not the model itself) so I'm glad you got 3.4k tok/s!</p>
]]></description><pubDate>Fri, 29 May 2026 18:10:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=48327030</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48327030</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48327030</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Why not, it's one way to look at it!
Although I have yet to see other work with speculative decoding higher than ~1,000 tokens/s., because the other bottlenecks start to matter at that point, and they need to be solved to go further.<p>Our view is that MTP / speculative decoding could help getting a X multiplier (X = 2 to 6) on the tokens per second speed we currently achieve.<p>We are a bit greedy, we want to stack optimizations on top of each other to get the maximum speed possible.<p>It involves additional compute to verify the predicted tokens during the forward pass (it's like a small batch), which should be totally doable for dense models, and will be more tricky for MoEs because it could mean activating more experts and thus more active parameters.</p>
]]></description><pubDate>Fri, 29 May 2026 17:32:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=48326440</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48326440</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48326440</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>it's also a coding model</p>
]]></description><pubDate>Fri, 29 May 2026 16:06:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=48325062</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48325062</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48325062</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>thank you deflator, I understand this now! much appreciated</p>
]]></description><pubDate>Fri, 29 May 2026 15:15:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=48324176</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48324176</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48324176</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Token generation speed matters for sequential agentic workflows, like software engineering / vibe coding, where a lot of reasoning tokens, code generation, refactoring, testing, etc. happen in a loop before an actual outcome is served to the user.<p>About model performance, we plan to support the latest frontier models (this tech preview is about the speed of the engine)</p>
]]></description><pubDate>Fri, 29 May 2026 15:13:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=48324135</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48324135</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48324135</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>will do - we are a small team and it takes time to implement and optimize a new model, whatever the size.</p>
]]></description><pubDate>Fri, 29 May 2026 15:10:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=48324079</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48324079</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48324079</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Totally, though DTP is not required for these kind of speeds.
Standard TP works also.<p>DTP is something we built for our roadmap in order to get to extremely high speeds (like 10k+ tokens/s). When the budget is under 10 µs per layer, any little overhead matters.<p>For 1k to 5k tokens/s, regular TP still works because we are able to optimize the inter-GPU all-reduce collectives at under 3 µs, which allows to continue streaming model weights in shared memory, registers and caches while GPUs exchange data.</p>
]]></description><pubDate>Fri, 29 May 2026 15:06:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=48324024</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48324024</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48324024</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Thanks for the comment and the question!<p>The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.<p>Also worth noting that our results are currently for standard <i>datacenter</i> GPUs. On consumer hardware, though the same low-level optimization approach applies, the bandwidth limitations will cap the achievable speed.</p>
]]></description><pubDate>Fri, 29 May 2026 13:38:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=48322964</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48322964</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48322964</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Thanks a lot! Much appreciated.<p>To answer your questions:<p>- yes, we rewrite the whole model code (while keeping the same logic) in CUDA/HIP and assembly, in order to optimize by hand for each GPU type. It's quite tedious for sure, but I guess this is the price to pay to get this kind of results.<p>- the batching question is a great one. In agentic systems, there is probably a trade-off between sequential thinking/iterations vs parallel exploration of multiple solutions. Also, there could just be multiple independent tasks running in parallel, depending on the use case.<p>We plan to support a small amount of batching, but it quickly becomes a trade-off vs speed. Pick one for your use case, I guess.<p>Also to consider: because we answer requests much faster, we are also able to process lots of them without needing high batches - and scaling on multiple nodes is possible.<p>- open sourcing: maybe, maybe not. I'm still undecided on this. We are a small startup and I'm told that giving our IP away might be shooting ourselves in the feet. On the other side, I think it could be of great benefit to the community and for us... we'll see</p>
]]></description><pubDate>Fri, 29 May 2026 12:53:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=48322460</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48322460</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48322460</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Yeah, I agree: I'm actually not expecting it to be easy, and there will certainly be several unknown unknowns we'll discover along the way.<p>Our process has been, and will continue to be, a sequence of (tedious) R&D experiments where the GPU never behaves as expected when pushed to its limits in ways no-one really tested before (I still have nightmares of the L3 cache cross-IOD bottlenecks on MI300X).<p>IMHO, we did solve the multi-GPU memory bandwidth scaling problem, and thus the linear scaling of the size of the model towards infinity.
But the main difficulties will come from keeping the speed, with steady and continuous memory streaming, while implementing the much more complex architecture of modern frontier MoEs (attention compression tricks, hash layers, routing logic, etc.)</p>
]]></description><pubDate>Fri, 29 May 2026 12:11:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=48322115</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48322115</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48322115</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Great points, let me clarify:<p>- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s<p>- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.<p>The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).<p>All our work at Kog is about removing these bottlenecks.</p>
]]></description><pubDate>Fri, 29 May 2026 11:45:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=48321907</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48321907</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48321907</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>I updated the article title accordingly</p>
]]></description><pubDate>Fri, 29 May 2026 11:38:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=48321851</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48321851</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48321851</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>YES - I just updated the title of our article according to your suggestion.</p>
]]></description><pubDate>Fri, 29 May 2026 11:37:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=48321846</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48321846</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48321846</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>Follow-up reading the most technical and research people here:<p>Monokernel deep dive (GPU Engineering): <a href="http://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus" rel="nofollow">http://blog.kog.ai/building-a-single-kernel-latency-optimize...</a><p>Delayed Tensor Parallelism (research): <a href="http://blog.kog.ai/delayed-tensor-parallelism-for-faster-transformer-inference" rel="nofollow">http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...</a><p>To try the speed on the playground: <a href="http://playground.kog.ai" rel="nofollow">http://playground.kog.ai</a></p>
]]></description><pubDate>Fri, 29 May 2026 11:09:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=48321626</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48321626</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48321626</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>I guessed you thought about consumer GPUs. We are about <i>standard</i> datacenter GPUs indeed.<p>Sorry for the confusion</p>
]]></description><pubDate>Fri, 29 May 2026 11:05:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=48321594</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48321594</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48321594</guid></item><item><title><![CDATA[New comment by gaeld in "Real-time LLM Inference on Standard GPUs: 3k tokens/s per request"]]></title><description><![CDATA[
<p>I guessed you thought about consumer GPUs.
We are about standard <i>datacenter</i> GPUs indeed.</p>
]]></description><pubDate>Fri, 29 May 2026 11:04:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=48321583</link><dc:creator>gaeld</dc:creator><comments>https://news.ycombinator.com/item?id=48321583</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48321583</guid></item></channel></rss>