Hacker News: nshm

New comment by nshm in "The past was not that cute"

nshm — Sun, 07 Dec 2025 11:28:03 +0000

> Running a family was a brutal two-person job -- and the kids had to dive in to help out the second they could lift something heavier than a couple pounds.

Orphanes did struggle but most families were not just two person, families were big and supported by community.

New comment by nshm in "Omnilingual ASR: Advancing automatic speech recognition for 1600 languages"

nshm — Tue, 11 Nov 2025 05:26:14 +0000

You can check whale sound recognition project https://arxiv.org/abs/2104.08614

New comment by nshm in "Omnilingual ASR: Advancing automatic speech recognition for 1600 languages"

nshm — Tue, 11 Nov 2025 05:05:07 +0000

And moreover, you can not tune those models for practical applications. The model is originally trained on very clean data, so lower layers are also not very stable for diverse inputs. To finetune you have to update the whole model, not just upper layers.

New comment by nshm in "Omnilingual ASR: Advancing automatic speech recognition for 1600 languages"

nshm — Tue, 11 Nov 2025 04:21:05 +0000

This model is actually expected to be bad for popular languages, just like previous MMS it is not accurate at all, it wins by supporting something rare well but never had good ASR accuracy even for Swedish etc. It is more a research thing than a real tool. Unlike Whisper.

New comment by nshm in "Sesame CSM: A Conversational Speech Generation Model"

nshm — Tue, 18 Mar 2025 15:33:05 +0000

It is useless actually. Very slow and quality is suboptimal and it is just speech generation component. See discussion here:

https://github.com/SesameAILabs/csm/issues/80

New comment by nshm in "What happened to BERT and T5?"

nshm — Fri, 19 Jul 2024 21:02:27 +0000

No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.

New comment by nshm in "Reasoning in Large Language Models: A Geometric Perspective"

nshm — Sun, 07 Jul 2024 21:11:29 +0000

It is actually pretty straightforward why those model "reason" or, to be more exact, can operate on a complex concepts. By processing huge amount of texts they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So they really distill knowledge. Alternatively you can think about it as a very good principal component analysis that can extract many important aspects. Or like a semantic graph built automatically.

Once knowledge is distilled you can build on top of it easily by merging concepts for example.

So no secret here.

New comment by nshm in "ChatTTS-Best open source TTS Model"

nshm — Wed, 29 May 2024 06:06:40 +0000

There is also a glitch in "dialogue"

New comment by nshm in "Sergey Brin on Gemini 1.5 Pro (03/02/2024) [video]"

nshm — Mon, 04 Mar 2024 19:18:38 +0000

Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking.

New comment by nshm in "BASE TTS: The largest text-to-speech model to-date"

nshm — Thu, 15 Feb 2024 21:53:13 +0000

Yes, it is one of the important aspects. In particular if you use TTS to create an audiobook or in a video production.

New comment by nshm in "BASE TTS: The largest text-to-speech model to-date"

nshm — Wed, 14 Feb 2024 21:58:25 +0000

Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.

New comment by nshm in "BASE TTS: The largest text-to-speech model to-date"

nshm — Wed, 14 Feb 2024 21:51:57 +0000

Metavoice is one of a dozen GPT-based TTS systems around starting from Tortoise. And not that great honestly. You can clearly hear "glass scratches" in their sound, it is because they trained on MP3-compressed data.

There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.

New comment by nshm in "OpenAI releases Whisper v3, new generation open source ASR model"

nshm — Mon, 06 Nov 2023 19:06:49 +0000

Good improvements for many languages, numbers here

https://github.com/openai/whisper/blob/main/language-breakdo...

New comment by nshm in "Goodbye, Node.js Buffer"

nshm — Tue, 24 Oct 2023 15:27:11 +0000

Ok, first we screwed buffers by making them globally tracked instead of just a piece of memory. Now its time to break all binary modules again.

New comment by nshm in "1400 year old gold foil figures found in Norwegian pagan temple"

nshm — Sun, 08 Oct 2023 07:29:36 +0000

Ok, but the photos look very suspicious. 1400 year gold right from the ground shouldn't shine like that. Compare to the coins here for example

https://www.smithsonianmag.com/smart-news/ancient-welsh-gold...

New comment by nshm in "LLaMa running at 5 tokens/second on a Pixel 6"

nshm — Wed, 15 Mar 2023 18:30:12 +0000

Great thanks a lot.

So we have numbers on PTB original perplexity 8.79 quantized 9.68, already 10% worse. And PPL reported per token I suppose? Because word PPL for PTB must be around 20, not less than 10.

Any numbers on more complex tasks then? like QA?

New comment by nshm in "LLaMa running at 5 tokens/second on a Pixel 6"

nshm — Wed, 15 Mar 2023 17:54:49 +0000

Do you have the numbers? I suspect is is way worse. Original llama.cpp authors never measure any numbers as well.

New comment by nshm in "LLaMa running at 5 tokens/second on a Pixel 6"

nshm — Wed, 15 Mar 2023 17:50:10 +0000

It is not really llama, it is llama quantized to 4bit. Not even the quality of original 7B. I could also quantize it to 1 bit and claim it runs on my RPI3.

New comment by nshm in "Tell HN: Please let me just buy stuff without having to “Contact Sales”"

nshm — Mon, 13 Feb 2023 22:43:27 +0000

In such an actively developed area like TTS/ASR there is high chance that custom solution would fit your needs much better. The feature set of TTS is actually pretty large and hard to combine in a single ML model. No free lunch you know.

For example if you look for singing voice, they might suggest you an adapted model that is good specifically for singing.

The testing process is also not very straight, you need to understand what to test and how to test properly. For example, some of their voices might be better for questions, some for news.

You'd better talk to them.

New comment by nshm in "Show HN: Self-host Whisper As a Service with GUI and queueing"

nshm — Mon, 13 Feb 2023 20:11:21 +0000

Vosk

https://alphacephei.com/vosk/lm

You can restrict the vocabulary the way you like, for example, here is the chess app built with Vosk

https://www.chessvis.com/