<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: germanjoey</title><link>https://news.ycombinator.com/user?id=germanjoey</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 29 Apr 2026 09:33:03 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=germanjoey" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by germanjoey in "We Made CUDA Optimization Suck Less"]]></title><description><![CDATA[
<p>TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.</p>
]]></description><pubDate>Wed, 14 May 2025 22:17:12 +0000</pubDate><link>https://news.ycombinator.com/item?id=43989756</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=43989756</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43989756</guid></item><item><title><![CDATA[New comment by germanjoey in "The Missing Nvidia GPU Glossary"]]></title><description><![CDATA[
<p>This is really incredible, thank you!</p>
]]></description><pubDate>Tue, 14 Jan 2025 22:05:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=42704562</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=42704562</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42704562</guid></item><item><title><![CDATA[New comment by germanjoey in "Trillium TPU Is GA"]]></title><description><![CDATA[
<p>Sambanova's RDU is a dataflow processor being used for ML/AI workloads! It's amazing and actually works.</p>
]]></description><pubDate>Fri, 13 Dec 2024 10:22:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=42407405</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=42407405</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42407405</guid></item><item><title><![CDATA[New comment by germanjoey in "Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference"]]></title><description><![CDATA[
<p>Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!</p>
]]></description><pubDate>Tue, 19 Nov 2024 03:07:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=42179744</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=42179744</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42179744</guid></item><item><title><![CDATA[New comment by germanjoey in "Cerebras Trains Llama Models to Leap over GPUs"]]></title><description><![CDATA[
<p>the title says "Cerebras Trains Llama Models"...</p>
]]></description><pubDate>Thu, 31 Oct 2024 03:29:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=42003207</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=42003207</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42003207</guid></item><item><title><![CDATA[New comment by germanjoey in "Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s"]]></title><description><![CDATA[
<p>They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.<p>A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.<p>It seems they also support only a very short sequence length. (1k tokens)</p>
]]></description><pubDate>Fri, 25 Oct 2024 06:25:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=41942732</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=41942732</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41942732</guid></item><item><title><![CDATA[New comment by germanjoey in "Civilization VII recommends 16 cores and 32GB RAM for 4K gameplay"]]></title><description><![CDATA[
<p>Simply increasing processing power for the AI isn't enough. Gameplay mechanics are intimately related to the capabilities of the AI.<p>For example, when they redesigned combat around the 1-Unit-Per-Tile (1UPT) mechanic for CIV 5, this crippled the ability of the AI to wage war. That's because even if a high-difficulty AI could out-produce the player in terms of military, they were logistics-limited in their ability to get those units to the front because of 1UPT. That means that the AI can't threaten a player militarily, and thus loses it's main lever in terms of it's ability to be "difficult."<p>Contrast this to Civ 4, where high-difficulty AIs were capable of completely overwhelming a player that didn't take them seriously. You couldn't just sit there and tech-up and use a small number of advanced units to fend off an invasion from a much larger and more aggressive neighbor. This was especially the case if you played against advanced fan-created AIs.<p>I'm hoping they get rid of 1UPT completely for Civ 7, but I have a feeling that it is unlikely because casual players (the majority purchaser for Civ) actually like that 1UPT effectively removes tactical combat from the game.</p>
]]></description><pubDate>Fri, 04 Oct 2024 21:45:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=41745848</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=41745848</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41745848</guid></item><item><title><![CDATA[New comment by germanjoey in "We fine-tuned Llama 405B on AMD GPUs"]]></title><description><![CDATA[
<p>How are you verifying accuracy for your JAX port of Llama 3.1?<p>IMHO, the main reason to use pytorch is actually that the original model used pytorch. What can seem to be identical logic between different model versions may actually cause model drift when infinitesimal floating point errors accumulate due to the huge scale of the data. My experience is that debugging an accuracy mismatches like this in a big model is a torturous ordeal beyond the 10th circle of hell.</p>
]]></description><pubDate>Mon, 23 Sep 2024 23:30:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=41631662</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=41631662</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41631662</guid></item><item><title><![CDATA[New comment by germanjoey in "A post by Guido van Rossum removed for violating Python community guidelines"]]></title><description><![CDATA[
<p>Looks like some kind of power play...<p>Originally discussed here: <a href="https://news.ycombinator.com/item?id=41234180">https://news.ycombinator.com/item?id=41234180</a></p>
]]></description><pubDate>Thu, 29 Aug 2024 00:20:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=41385859</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=41385859</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41385859</guid></item><item><title><![CDATA[Sambanova breaks 1000 tokens/SEC on LLama3 8B]]></title><description><![CDATA[
<p>Article URL: <a href="https://twitter.com/ArtificialAnlys/status/1795480857404448953">https://twitter.com/ArtificialAnlys/status/1795480857404448953</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=40506135">https://news.ycombinator.com/item?id=40506135</a></p>
<p>Points: 7</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 28 May 2024 22:07:43 +0000</pubDate><link>https://twitter.com/ArtificialAnlys/status/1795480857404448953</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=40506135</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40506135</guid></item><item><title><![CDATA[New comment by germanjoey in "Model Explorer: intuitive and hierarchical visualization of model graphs"]]></title><description><![CDATA[
<p>Is there a demo of a model visualized using this somewhere? Even if it's just a short video... it's hard to tell what it's like from screenshots.</p>
]]></description><pubDate>Tue, 14 May 2024 21:25:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=40360334</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=40360334</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40360334</guid></item><item><title><![CDATA[New comment by germanjoey in "Groq CEO: 'We No Longer Sell Hardware'"]]></title><description><![CDATA[
<p>cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.</p>
]]></description><pubDate>Mon, 08 Apr 2024 00:18:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=39965094</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=39965094</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39965094</guid></item><item><title><![CDATA[New comment by germanjoey in "Try SambaNova chat: 1T param LLM, 500 tokens/SEC"]]></title><description><![CDATA[
<p>We're showing off our 1.05T param Composition of Experts LLM! It's 150 experts running on 1 node consisting of 8 SN40L RDU chips.<p>Each of our nodes has a huge amount of DDR attached, in addition to copious amounts of on-chip HBM and SRAM. This allows the system to switch between a variety of different models of different sizes and architectures at lightning speed. A highlight is one based on Llama2 7b, similar to the Groq demo, but executing with bf16/fp32 instead of int8. (And using only 8 chips instead of 568!)</p>
]]></description><pubDate>Fri, 29 Mar 2024 16:50:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=39866205</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=39866205</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39866205</guid></item><item><title><![CDATA[Try SambaNova chat: 1T param LLM, 500 tokens/SEC]]></title><description><![CDATA[
<p>Article URL: <a href="https://coe-1.cloud.snova.ai/">https://coe-1.cloud.snova.ai/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=39866204">https://news.ycombinator.com/item?id=39866204</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Fri, 29 Mar 2024 16:50:10 +0000</pubDate><link>https://coe-1.cloud.snova.ai/</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=39866204</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39866204</guid></item><item><title><![CDATA[New comment by germanjoey in "A hacker's guide to language models [video]"]]></title><description><![CDATA[
<p>Sambanova just launched something similar to what you're describing. It's a demo of their new chip running a 1T param MoE model 150 7B llama2s, each retrained  to be an expert in a different topic. So one of them is a "law" expert, another on "physics", etc.<p>They've got a video here [1] (scroll down slightly) that compares it against a 180B Falcon model that's running on GPUs on HuggingFace. The MoE results are not only just as good quality-wise, but also ridiculously fast. Like, nearly instant. A big benefit is that the experts can be swapped-out and retrained with new data, which is obviously not as easy with the more monolithic 180B model.<p>[1] <a href="https://sambanova.ai/launch2023" rel="nofollow noreferrer">https://sambanova.ai/launch2023</a></p>
]]></description><pubDate>Mon, 25 Sep 2023 01:07:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=37638630</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=37638630</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37638630</guid></item><item><title><![CDATA[SambaNova launches new SN40L chip; demo of 1T param CoE LLM]]></title><description><![CDATA[
<p>Article URL: <a href="https://sambanova.ai/launch2023/">https://sambanova.ai/launch2023/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=37590334">https://news.ycombinator.com/item?id=37590334</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 20 Sep 2023 21:20:02 +0000</pubDate><link>https://sambanova.ai/launch2023/</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=37590334</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37590334</guid></item><item><title><![CDATA[New comment by germanjoey in "GPT-4"]]></title><description><![CDATA[
<p>welp,<p>This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a
Transformer-style model [33 ] pre-trained to predict the next token in a document, using both publicly
available data (such as internet data) and data licensed from third-party providers. The model was
then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) [34 ]. Given both
the competitive landscape and the safety implications of large-scale models like GPT-4, this report
contains no further details about the architecture (including model size), hardware, training compute,
dataset construction, training method, or similar.</p>
]]></description><pubDate>Tue, 14 Mar 2023 19:22:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=35157016</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=35157016</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=35157016</guid></item><item><title><![CDATA[New comment by germanjoey in "GPT-4"]]></title><description><![CDATA[
<p>How big is this model? (i.e., how many parameters?) I can't find this anywhere.</p>
]]></description><pubDate>Tue, 14 Mar 2023 18:54:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=35156664</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=35156664</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=35156664</guid></item><item><title><![CDATA[New comment by germanjoey in "The maze is in the mouse: what ails Google"]]></title><description><![CDATA[
<p>I worked with the author for a couple of years, pre- and post- acquisition, and I have to admit that he drove me somewhat crazy sometimes too. Leaving that aside, I also had an immense amount of personal respect for him as I could see how much he very genuinely cares about what he is doing. And, that's actively doing his best to do right by his customers. I think the author is 110% spot-on with his critique of Google here.</p>
]]></description><pubDate>Wed, 15 Feb 2023 17:21:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=34807244</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=34807244</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=34807244</guid></item><item><title><![CDATA[New comment by germanjoey in "Amazon to Lay Off over 17,000 Workers, More Than First Planned"]]></title><description><![CDATA[
<p>What's the new performance process?</p>
]]></description><pubDate>Thu, 05 Jan 2023 01:09:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=34254510</link><dc:creator>germanjoey</dc:creator><comments>https://news.ycombinator.com/item?id=34254510</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=34254510</guid></item></channel></rss>