Hacker News: germanjoey

New comment by germanjoey in "We Made CUDA Optimization Suck Less"

germanjoey — Wed, 14 May 2025 22:17:12 +0000

TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.

New comment by germanjoey in "The Missing Nvidia GPU Glossary"

germanjoey — Tue, 14 Jan 2025 22:05:23 +0000

This is really incredible, thank you!

New comment by germanjoey in "Trillium TPU Is GA"

germanjoey — Fri, 13 Dec 2024 10:22:30 +0000

Sambanova's RDU is a dataflow processor being used for ML/AI workloads! It's amazing and actually works.

New comment by germanjoey in "Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference"

germanjoey — Tue, 19 Nov 2024 03:07:16 +0000

Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!

New comment by germanjoey in "Cerebras Trains Llama Models to Leap over GPUs"

germanjoey — Thu, 31 Oct 2024 03:29:41 +0000

the title says "Cerebras Trains Llama Models"...

New comment by germanjoey in "Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s"

germanjoey — Fri, 25 Oct 2024 06:25:20 +0000

They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

It seems they also support only a very short sequence length. (1k tokens)

New comment by germanjoey in "Civilization VII recommends 16 cores and 32GB RAM for 4K gameplay"

germanjoey — Fri, 04 Oct 2024 21:45:14 +0000

Simply increasing processing power for the AI isn't enough. Gameplay mechanics are intimately related to the capabilities of the AI.

For example, when they redesigned combat around the 1-Unit-Per-Tile (1UPT) mechanic for CIV 5, this crippled the ability of the AI to wage war. That's because even if a high-difficulty AI could out-produce the player in terms of military, they were logistics-limited in their ability to get those units to the front because of 1UPT. That means that the AI can't threaten a player militarily, and thus loses it's main lever in terms of it's ability to be "difficult."

Contrast this to Civ 4, where high-difficulty AIs were capable of completely overwhelming a player that didn't take them seriously. You couldn't just sit there and tech-up and use a small number of advanced units to fend off an invasion from a much larger and more aggressive neighbor. This was especially the case if you played against advanced fan-created AIs.

I'm hoping they get rid of 1UPT completely for Civ 7, but I have a feeling that it is unlikely because casual players (the majority purchaser for Civ) actually like that 1UPT effectively removes tactical combat from the game.

New comment by germanjoey in "We fine-tuned Llama 405B on AMD GPUs"

germanjoey — Mon, 23 Sep 2024 23:30:51 +0000

How are you verifying accuracy for your JAX port of Llama 3.1?

IMHO, the main reason to use pytorch is actually that the original model used pytorch. What can seem to be identical logic between different model versions may actually cause model drift when infinitesimal floating point errors accumulate due to the huge scale of the data. My experience is that debugging an accuracy mismatches like this in a big model is a torturous ordeal beyond the 10th circle of hell.

New comment by germanjoey in "A post by Guido van Rossum removed for violating Python community guidelines"

germanjoey — Thu, 29 Aug 2024 00:20:11 +0000

Looks like some kind of power play...

Originally discussed here: https://news.ycombinator.com/item?id=41234180

Sambanova breaks 1000 tokens/SEC on LLama3 8B

germanjoey — Tue, 28 May 2024 22:07:43 +0000

Article URL: https://twitter.com/ArtificialAnlys/status/1795480857404448953

Comments URL: https://news.ycombinator.com/item?id=40506135

Points: 7

# Comments: 0

New comment by germanjoey in "Model Explorer: intuitive and hierarchical visualization of model graphs"

germanjoey — Tue, 14 May 2024 21:25:28 +0000

Is there a demo of a model visualized using this somewhere? Even if it's just a short video... it's hard to tell what it's like from screenshots.

New comment by germanjoey in "Groq CEO: 'We No Longer Sell Hardware'"

germanjoey — Mon, 08 Apr 2024 00:18:27 +0000

cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.

New comment by germanjoey in "Try SambaNova chat: 1T param LLM, 500 tokens/SEC"

germanjoey — Fri, 29 Mar 2024 16:50:10 +0000

We're showing off our 1.05T param Composition of Experts LLM! It's 150 experts running on 1 node consisting of 8 SN40L RDU chips.

Each of our nodes has a huge amount of DDR attached, in addition to copious amounts of on-chip HBM and SRAM. This allows the system to switch between a variety of different models of different sizes and architectures at lightning speed. A highlight is one based on Llama2 7b, similar to the Groq demo, but executing with bf16/fp32 instead of int8. (And using only 8 chips instead of 568!)

Try SambaNova chat: 1T param LLM, 500 tokens/SEC

germanjoey — Fri, 29 Mar 2024 16:50:10 +0000

Article URL: https://coe-1.cloud.snova.ai/

Comments URL: https://news.ycombinator.com/item?id=39866204

Points: 1

# Comments: 1

New comment by germanjoey in "A hacker's guide to language models [video]"

germanjoey — Mon, 25 Sep 2023 01:07:51 +0000

Sambanova just launched something similar to what you're describing. It's a demo of their new chip running a 1T param MoE model 150 7B llama2s, each retrained to be an expert in a different topic. So one of them is a "law" expert, another on "physics", etc.

They've got a video here [1] (scroll down slightly) that compares it against a 180B Falcon model that's running on GPUs on HuggingFace. The MoE results are not only just as good quality-wise, but also ridiculously fast. Like, nearly instant. A big benefit is that the experts can be swapped-out and retrained with new data, which is obviously not as easy with the more monolithic 180B model.

[1] https://sambanova.ai/launch2023

SambaNova launches new SN40L chip; demo of 1T param CoE LLM

germanjoey — Wed, 20 Sep 2023 21:20:02 +0000

Article URL: https://sambanova.ai/launch2023/

Comments URL: https://news.ycombinator.com/item?id=37590334

Points: 3

# Comments: 0

New comment by germanjoey in "GPT-4"

germanjoey — Tue, 14 Mar 2023 19:22:56 +0000

welp,

This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a Transformer-style model [33 ] pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) [34 ]. Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

New comment by germanjoey in "GPT-4"

germanjoey — Tue, 14 Mar 2023 18:54:24 +0000

How big is this model? (i.e., how many parameters?) I can't find this anywhere.

New comment by germanjoey in "The maze is in the mouse: what ails Google"

germanjoey — Wed, 15 Feb 2023 17:21:17 +0000

I worked with the author for a couple of years, pre- and post- acquisition, and I have to admit that he drove me somewhat crazy sometimes too. Leaving that aside, I also had an immense amount of personal respect for him as I could see how much he very genuinely cares about what he is doing. And, that's actively doing his best to do right by his customers. I think the author is 110% spot-on with his critique of Google here.

New comment by germanjoey in "Amazon to Lay Off over 17,000 Workers, More Than First Planned"

germanjoey — Thu, 05 Jan 2023 01:09:44 +0000

What's the new performance process?