New comment by mitchelld in "S1: A $6 R1 competitor?"

mitchelld — Wed, 05 Feb 2025 22:44:45 +0000

> This is not "just" machine learning because we have never been able to do things which we are today and this is not only the result of better hardware. Better hardware is actually a byproduct. Why build a PFLOPS GPU when there is nothing that can utilize it?

This is the line of thinking I'm referring to.

The "context" problem had already been somewhat solved. The attention mechanism existed prior to Transformers and was specifically used on RNNs. They certainly improved it, but innovation of the architecture was making it computation efficient to train.

I'm not really following your argument. Clearly your acknowledging that it was first the case that with the hardware at the time, researchers demonstrated that simply scaling up training with more data yielded better models. The fact that hardware was then optimized for these for these architectures only reinforces this point.

All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.

> this is not only the result of better hardware

Regarding this in particular. A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.

New comment by mitchelld in "S1: The $6 R1 Competitor?"

mitchelld — Wed, 05 Feb 2025 17:03:09 +0000

This line of thinking doesn't really correspond to the reason Transformers were developed in the first place, which was to better utilize how GPUs do computation. RNNs were too slow to train at scale because you had to sequentially compute the time steps, Transformers (with masking) can run the input through in a single pass.

It is worth noting that the first "LLM" you referring to was only 300M parameters, but even then the amount of training required (at the time) was such that training a model like that outside of a big tech company was infeasible. Obviously now we have models that are in the hundreds of billions / trillions of parameters. The ability to train these models is directly a result of better / more hardware being applied to the problem as well as the Transformer architecture specifically designed to better conform with parallel computation at scale.

The first GPT model came out ~ 8 years ago. I recall when GPT-2 came out they initially didn't want to release the weights out of concern for what the model could be used for, looking back now that's kind of amusing. However, fundamentally, all these models are the same setup as what was used then, decoder based Transformers. They are just substantially larger, trained on substantially more data, trained with substantially more hardware.

Hacker News: mitchelld

New comment by mitchelld in "S1: A $6 R1 competitor?"

New comment by mitchelld in "S1: The $6 R1 Competitor?"