New comment by pseudollm in "Gemma 4 12B: A unified, encoder-free multimodal model"

pseudollm — Thu, 04 Jun 2026 00:01:14 +0000

No there isn't - read the paper. It's just 40msec raw audio samples. Multiplied by one matrix to translate to 3800 input vector. That's it. The next 40 msec are fed in the next transformer input step. Without any positional encoding. Repeat ad infinitum

New comment by pseudollm in "Gemma 4 12B: A unified, encoder-free multimodal model"

pseudollm — Wed, 03 Jun 2026 23:52:43 +0000

> usefulness of the RTX Spark

Not really. There's a reason the announcement didn't include ANY benchmark (!) and didn't mention EXACTLY what is the memory bandwidth. It's going to be dog-slow unusable for large models, as tok/sec is basically bandwidth divided by active weights. Rumoured 300GB/s / 30GB active weights (decent model) = 10 tokens per second, which is really slow

Hacker News: pseudollm

New comment by pseudollm in "Gemma 4 12B: A unified, encoder-free multimodal model"

New comment by pseudollm in "Gemma 4 12B: A unified, encoder-free multimodal model"