Hacker News: tbalsam

New comment by tbalsam in "NanoChat – The best ChatGPT that $100 can buy"

tbalsam — Mon, 13 Oct 2025 21:07:32 +0000

This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).

Both share equal credit I feel (also, the paper's co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.

(Source: am experienced speedrunner who's been in these circles for a decent amount of time)

New comment by tbalsam in "Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it"

tbalsam — Fri, 03 Oct 2025 13:12:20 +0000

shocked quack

New comment by tbalsam in "Problem solving using Markov chains (2007) [pdf]"

tbalsam — Thu, 31 Jul 2025 16:36:56 +0000

I'm not entirely sure, to be honest. If you look at the linked video, they state that it's oftentimes not in the best interest of the private equity group's moneymaking capabilities to announce that a channel has been sold out to them.

How that is in practice, I'm not sure, and I'm sure with some sleuthing it would be possible to find out at least some of it. But on the whole, I'm honestly not sure beyond that.

New comment by tbalsam in "Problem solving using Markov chains (2007) [pdf]"

tbalsam — Wed, 30 Jul 2025 16:48:04 +0000

They unfortunately recently (last few years) sold out to private equity (which tends to glaze over fundamentals and tries to pump out massive content using previous brand quality to give it credence), so beware of quality in more recent vids:

https://youtu.be/hJ-rRXWhElI?si=Zdsj9i_raNLnajzi

New comment by tbalsam in "Large language models are improving exponentially?"

tbalsam — Sat, 05 Jul 2025 12:47:42 +0000

There are versions of this kind of benchmark with a higher threshold, however, it only seems to adjust the timetables by a linear amount, so you're only buying 1-2 years or so depending on what you want that % success rate to be.

New comment by tbalsam in "Large language models are improving exponentially"

tbalsam — Sat, 05 Jul 2025 12:46:11 +0000

For those curious: https://en.m.wikipedia.org/wiki/Zombo.com

New comment by tbalsam in "Large language models are improving exponentially?"

tbalsam — Sat, 05 Jul 2025 12:42:03 +0000

The only limit is yourself

Source: One of the most classic internet websites, zombo.com (sound on)

New comment by tbalsam in "Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken"

tbalsam — Mon, 30 Jun 2025 15:38:28 +0000

No! This is not good.

Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.

Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.

New comment by tbalsam in "Look Ma, No Bubbles: Designing a Low-Latency Megakernel for Llama-1B"

tbalsam — Wed, 28 May 2025 01:01:43 +0000

This is (and was) the dream of Cerebras and I am very glad to see it embraced if even in small part on a GPU. Wild to see how much performance is left on the table for these things, it's crazy to think how much can be done by a few bold individuals when it comes to pushing the SOTA of these kinds of things (not just in kernels either -- in other areas as well!)

My experience has been that getting over the daunting factor of feeling afraid of a big wide world with a lot of noise and marketing and simply committing to a problem, learning it, and slowly bootstrapping it over time, tends to yield phenomenal results in the long run for most applications. And, if not, then there's often an applicable one/side field that can be pivoted to for still making immense/incredible progress.

The big players may have the advantage of scale, but there is so, so much that can be done still if you look around and keep a good feel for it. <3 :)

New comment by tbalsam in "The Speed of VITs and CNNs"

tbalsam — Mon, 05 May 2025 15:07:06 +0000

Not bad frustrations at all. That said -- IoU is how the final box scores are calculated, that doesn't change how you do feature aggregation, this will happen in basically any technique you use.

Modern SSD/YOLO-style detectors use efficient feature pyramids, you need that to know where to propose where things are in the image.

This sounds a lot like going back to the old school object detection techniques which end up being more inefficient in general, generally very compute inefficient.

New comment by tbalsam in "The Speed of VITs and CNNs"

tbalsam — Mon, 05 May 2025 15:03:32 +0000

> The MSE here is not intended to be a training loss, but as a means to demonstrate that both approaches lead to almost the same result except for some rounding error.

Ah, gotcha

> I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)

MaxPooling is generally only useful if you're training your network for it, but in most cases it ends up performing better. That sparsity actually ends up being a good thing -- you generally need to suppress all of those unused activations! It ends up being quite a wide gap in practice (and, if you have convolutions beforehand -- using avgpooling2d is a bit of extra wasted extra computation blurring the input)

> Could you elaborate on that?

Variable-sized inputs don't batch easily as the input dims need to match, you can go down the padding route but that has its own particularly hellacious costs with it that end up taking away from compute that you could be using for other useful things.

New comment by tbalsam in "The Speed of VITs and CNNs"

tbalsam — Sun, 04 May 2025 21:06:37 +0000

As someone who's done a fair bit of architecture work -- both are important! Making it either or is a very silly thing, both are the limiting factor for the other and there are no two ways about it.

Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.

Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!

But that being said, adaptive stuff certainly is great when doing classification. Something to note is that batching does become an issue at a certain point -- as well as certain other fine-grained details if you're simply going to average it all down to one single vector (IIUC).

New comment by tbalsam in "The Speed of VITs and CNNs"

tbalsam — Sun, 04 May 2025 17:50:40 +0000

As someone who has worked in computer vision ML for nearly a decade, this sounds like a terrible idea.

You don't need RL remotely for this usecase. Image resolution pyramids are pretty normal tho and handling them well/efficiently is the big thing. Using RL for this would be like trying to use graphene to make a computer screen because it's new and flashy and everyone's talking about it. RL is inherently very sample inefficient, and is there to approximate when you don't have certain defined informative components, which we do have in computer vision in spades. Crossentropy losses (and the like) are (generally, IME/IMO) what RL losses try to approximate, only on a much larger (and more poorly-defined) scale.

Please mark speculation as such -- I've seen people see confident statements like this and spend a lot of time/manhours on it (because it seems plausible). It is not a bad idea from a creativity standpoint, but practically is most certainly not the way to go about it.

(That being said, you can try for dynamic sparsity stuff, it has some painful tradeoffs that generally don't scale but no way in Illinois do you need RL for that)

New comment by tbalsam in "Show HN: I built a synthesizer based on 3D physics"

tbalsam — Sat, 03 May 2025 01:15:47 +0000

McCormick is a popular brand of seasonings hahaha

https://i5.walmartimages.com/seo/McCormick-Pure-Ground-Black...

New comment by tbalsam in "Show HN: I built a synthesizer based on 3D physics"

tbalsam — Fri, 02 May 2025 21:50:56 +0000

Yes, it's a synthesizer -- you may know it inside and out, but having demo videos showing what it can do will help people with no context get that quick "ahhh, that makes sense" moment from things. :)

New comment by tbalsam in "The Impossible Contradictions of Mark Twain"

tbalsam — Fri, 02 May 2025 20:53:23 +0000

If you would like an original link non-heisted through the gwern domain, I'd encourage you to read it from the original UPenn link (University the professor who wrote this works at): https://web.english.upenn.edu/~cavitch/pdf-library/Cavitch_T...

New comment by tbalsam in "Hacker News Hug of Deaf"

tbalsam — Thu, 10 Apr 2025 16:26:34 +0000

I did my part and manually reloaded the page about once a second for 5 minutes so that Andrew could get their dev validation beep quota in for the day (unless it's not naive hits, and unique user based, in which case this has a been a fantastically hilarious waste of time).

New comment by tbalsam in "Japanese scientists create new plastic that dissolves in saltwater overnight"

tbalsam — Fri, 28 Mar 2025 15:31:50 +0000

I don't think they really did? A single scratch to cause it to break down doesn't seem like it would really be a scalable solution for any kind of mass produced material like this. Would cause chaos if any individual container went bad in a shipment, so it's not really addressed I feel. OP's concerns still stand.

New comment by tbalsam in "Speedrunners are vulnerability researchers, they just don't know it yet"

tbalsam — Sun, 02 Mar 2025 20:24:33 +0000

I'm a speedrunner, and I'm pretty sure this is well known -- and accepted as standard in some categories! It's a pretty well accepted standard (to the point of the headline being almost a mild offense!).

In the gaming world, undefined software behavior is critical to this sort of thing, we see this especially in some games like the legendary exploits found in the Ocarina of Time speedruns for example.

I mean, in Super Mario World, SethBling did code injection to manually run a version of Flappy Bird (how ironic given the origin of the pipes!) in the game. By hand. No savestates. It took forever and the run through is really and truly fascinating: https://youtu.be/hB6eY73sLV0?si=nIP07o_fa6O9rauW

I speedrun things other than games as well -- and so the generalization is not just that we are security researchers, we are people who fundamentally learn the "shape" of a thing very, very well, and ways that this shape can be used to get from one state on that shape to another.

In conclusion -- yes, it can be something as simple as security research! But the joy and the beauty of speedrunning is something so much bigger and beautiful than that -- though it certainly is one outcome that can be had!

New comment by tbalsam in "Generating an infinite world with the Wave Function Collapse algorithm"

tbalsam — Thu, 23 Jan 2025 15:55:30 +0000

Yes, this is the principle that violates the WFC algorithm and makes it no longer WFC

It is now just a procedural algorithm, which is faster than but loses some of the magic of what makes WFC _so good_.

You can tell by looking at the renders too, the before-and-after of both methods. The difference is incomparable.

That being said, it is cool as a runtime-optimized non-WFC WFC-approximating algorithm.