Hacker News: jwan584

Mistral Flash Answers Run on Cerebras

jwan584 — Thu, 06 Feb 2025 23:24:11 +0000

Article URL: https://cerebras.ai/blog/mistral-le-chat

Comments URL: https://news.ycombinator.com/item?id=42967522

Points: 5

# Comments: 1

New comment by jwan584 in "The impact of competition and DeepSeek on Nvidia"

jwan584 — Mon, 27 Jan 2025 08:10:24 +0000

The point about using FP32 for training is wrong. Mixed precision (FP16 multiplies, FP32 accumulates) has been use for years – the original paper came out in 2017.

New comment by jwan584 in "100x defect tolerance: How we solved the yield problem"

jwan584 — Wed, 15 Jan 2025 23:30:03 +0000

A good talk on how Cerebras does power & cooling (8min) https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...

100x defect tolerance: How we solved the yield problem

jwan584 — Wed, 15 Jan 2025 21:19:15 +0000

Article URL: https://cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem

Comments URL: https://news.ycombinator.com/item?id=42717165

Points: 331

# Comments: 179

New comment by jwan584 in "Cerebras Inference: AI at Instant Speed"

jwan584 — Tue, 27 Aug 2024 20:19:23 +0000

batch size by Q4 will be solid double digits (cerebras employee)

Cerebras CS-3: the fastest and most scalable AI accelerator

jwan584 — Wed, 13 Mar 2024 22:13:28 +0000

Article URL: https://www.cerebras.net/blog/cerebras-cs3

Comments URL: https://news.ycombinator.com/item?id=39698217

Points: 2

# Comments: 0

New comment by jwan584 in "GigaGPT: GPT-3 sized models in 565 lines of code"

jwan584 — Mon, 11 Dec 2023 19:52:44 +0000

when you go from 1B to 175B, the model no longer fits in memory. so in practice you have to re-factor the model using tensor/pipeline parallelism. that's why it goes from 600 to 20K LOC.

New comment by jwan584 in "GigaGPT: GPT-3 sized models in 565 lines of code"

jwan584 — Mon, 11 Dec 2023 19:50:26 +0000

Everyone knows Cerebras by their wafer scale chips. The less understood part is the 12TB of external memory. That's the real reason why large models fit by default and you don't have to chop it up in software ala megatron/deepspeed.

New comment by jwan584 in "BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model"

jwan584 — Fri, 22 Sep 2023 18:35:13 +0000

A helpful paper with the full recipe Cerebras uses to train LLMs and their process including: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay