Hacker News: formalsystem

New comment by formalsystem in "Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels"

formalsystem — Wed, 03 Sep 2025 22:18:37 +0000

I work on PyTorch and there are many things that make me suspicious about these results. My TL;DR is unless we get a zip file of all the kernels with how they're benchmarked results like this are almost impossible to verify

1. I don't have an M4 but I have an M1 Pro and I tried running the claimed 18x speedup VisionAttention attention example and I get close to identical runtimes. This example has more issues the main optimization the LLM is doing is a fusion and so not comparing to torch.compile is a bit sus. The numerics are off as well and I suspect the atols were way too big. Finally MultiHeadAttention is a deprecated API so using neither SDPA or torch.compile is a weird choice

2. In general 18x (and even some 100x speedups claimed near the end) are just a smell that some kernel is incorrect, the typical way you can get speedups like this is you don't warmup or you forget to synchronize. PyTorch has a lot of benchmarking footguns which is why sharing the exact eval scripts is helpful

3. Speaking of footguns, the shapes I saw in the examples were tiny, in that regime you're more often measuring noise as the primary bottleneck is not compute or memory but overhead

4. Generating many random shapes is also not so safe, some input distributions can make certain kernels trivial for example torch.randn() by default generates samples from a normal distribution with mean 0 and variance 1 and so if you take the mean of a large vector you're almost guaranteed to just get 0 esp if your tolerance is too high

5. KernelBench levels measure vastly different things and if you want to compare to PyTorch operators you want to focus on Level 1, Level 2 is fusions and so the right baseline is torch.compile and more reliable on nightlies. The Mamba 2 example (which I didn't run) also acknowledges that the primary thing it does is fusions which assuming everything is correct would still be strange to baseline vs eager

So please for everyone's sanity if you find a kernel that's 10-100x faster please share the exact code and benchmarking methodology to your smartest performance friends, you should be extremely skeptical of such results often you can discard some numbers based on a simple speed of light analysis. We all desperately want faster kernels but to get them we have to be really fanatical about correctness.

Torch.load flipping default to weights_only=True

formalsystem — Mon, 04 Nov 2024 17:33:14 +0000

Article URL: https://dev-discuss.pytorch.org/t/bc-breaking-change-torch-load-is-being-flipped-to-use-weights-only-true-by-default-in-the-nightlies-after-137602/2573

Comments URL: https://news.ycombinator.com/item?id=42043921

Points: 2

# Comments: 0

New comment by formalsystem in "ThunderKittens: Simple, fast, and adorable AI kernels"

formalsystem — Wed, 30 Oct 2024 23:47:02 +0000

The project is very much focused on maxing out tensor cores and since older GPUs don’t have them it’s not where the project shines best

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Fri, 25 Oct 2024 21:07:25 +0000

Please ignore my previous comments - I double checked with the model developers and here's the correction. Vanilla PTQ means no fancy quantization algorithm like SpinQuant, AWQ, etc. was applied. It just applied the same quantization scheme mentioned in the post (4bit per-group with g_size=32 symmetric weight, 8bit dynamic per token activation).

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Fri, 25 Oct 2024 16:48:34 +0000

You can estimate context length impact by doing back of the envelope calculations on KV cache size: 2 * layers * attention heads * head_dim * byte_per_element * batch_size * sequence_length

Some pretty charts here https://github.com/pytorch/ao/issues/539

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Fri, 25 Oct 2024 05:56:34 +0000

The issue here is memory in PyTorch is byte addressable and that's a limitation we can't solve without making a lot more changes to PyTorch. But in your specific case, if you'd like to pack more data into `values` you can use a combination of clever bit shifting, torch.cat and other bit twiddling pytorch like ops to pack more data. It's a trick we use quite heavily in torchao

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Fri, 25 Oct 2024 05:53:03 +0000

Not that I know of for this study, at least for the specific scope torchao we want to make it easier for researchers to create new quantization algorithms in python and have those algorithms run fast and you can see a lot of those algorithms here https://github.com/pytorch/ao/tree/main/torchao/prototype

So for example for AWQ and GPTQ we can accelerate them by using a fast int4 kernel called tinygemm

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Fri, 25 Oct 2024 04:12:17 +0000

The naming is unfortunate but in this blog QLoRA is referring to Quantization-Aware Training with LoRA adaptor

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Fri, 25 Oct 2024 03:40:50 +0000

My wife calls it torch AAAW

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Thu, 24 Oct 2024 23:05:58 +0000

So this should be referring to w8a8 (weights and activations in 8 bit)

So this is gonna be 8 bit weights, 8 bit activations, group size of 256, symmetric quantization. Not sure how to map this to the GGUF variants because they don't mention how they don't do activation quantization

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Thu, 24 Oct 2024 22:40:01 +0000

It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.

At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8

New comment by formalsystem in "Quantized Llama models with increased speed and a reduced memory footprint"

formalsystem — Thu, 24 Oct 2024 22:37:44 +0000

Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!

New comment by formalsystem in "PyTorch Native Architecture Optimization: Torchao"

formalsystem — Tue, 01 Oct 2024 16:10:15 +0000

It's a great question! Int4 is an easy one to understand. PyTorch supports int8 but not int4 so what you can do is "pack" 2 int4 values into a single int8 value. You still get speedups even without hardware support because you're sending less data to the GPU and workloads like small batch size LLM inference are memory bandwidth bound and not compute bound. So indeed your intuition is correct you pack the values and before doing a matmul you "unpack" them back into an int8 and then upcast to fp16 to do a matmul

New comment by formalsystem in "PyTorch Native Architecture Optimization: Torchao"

formalsystem — Tue, 01 Oct 2024 16:07:19 +0000

we have experimental support for float4 training with the mx formats https://github.com/pytorch/ao/tree/main/torchao/prototype/mx...

But that's waiting for Blackwell to be released so we get the hardware support. SO recommendation for now would be to use either fp8 training or int8 training

New comment by formalsystem in "PyTorch Native Architecture Optimization: Torchao"

formalsystem — Tue, 01 Oct 2024 16:06:22 +0000

yeah indeed choice of language might not be ideal, it seems like 2x language is clearest to folks? I can make some quick edits to the article

New comment by formalsystem in "PyTorch Native Architecture Optimization: Torchao"

formalsystem — Tue, 01 Oct 2024 06:39:26 +0000

It's both! For this blog we decided to discuss our best end user facing numbers to keep things simple. We briefly hint at our contributor guide here https://github.com/pytorch/ao/issues/391 which does a tour of the APIs we provide developers implementing new algorithms

But we have had quantization algorithm developers such as HQQ or Autoround merge their code in to get composability and serialization for free. We view quantization algorithms as the top layer and going down you have quantized tensors, quant primitives like dequant/quant and finally basic dtypes like uint1-7 and float3-8. Personally why I spent so much time on AO was I was hoping we could make it easier for people to express their quantization algorithms in easy to read PyTorch code and if they must use custom kernels we also have some tutorials for how to integrate custom cuda and triton ops.

Most of those discussions have been happening on #torchao on discord.gg/gpumode so if you need to chat back and forth feel free to reach out to the team there otherwise Github also works.

New comment by formalsystem in "PyTorch Native Architecture Optimization: Torchao"

formalsystem — Tue, 01 Oct 2024 03:15:58 +0000

There's different tradeoffs, spinning up a separate repo is what we call "out of core" vs having everything in PyTorch "in core"

Basically PyTorch is a large library where CI takes a long time to run which means merging code is hard and adding new dependencies is challenging and there are stringent constraints on BC breaking changes

Instead what torchao did and many other repos like torchtune, torchchat, torchtitan did was move out of core and it helps keep the core PyTorch library leaner with a smaller binary size and it really lets the team "out of core" focus on optimizing for their needs

Unfortunately the argument for what gets better changes over time, for example torch.compile initially a new repo called torchdynamo was built out of core to move fast but eventually merged back because everyone wanted to use it. Now torch.compile dev velocity is still quite fast and so now we have to tell people to use nightlies instead of official stable releases to which some people have asked me why don't you move torch.compile out of core

My 2c is the ecosystem will be much stronger and teams can move faster if they develop out of core so that's the tradeoff we picked for torchao. We managed to for example merge a few custom CPP kernels like fp6 or Marlin that would have challenging to motivate in core since those are still quite experimental and need to stand the test of time.

New comment by formalsystem in "PyTorch Native Architecture Optimization: Torchao"

formalsystem — Tue, 01 Oct 2024 02:44:27 +0000

Mostly comes down to what's fastest to develop, it's faster to write a few custom kernels than it is to develop a new compiler backend

Granted after more upfront effort compilers are just such a significant UX boost that indeed you are making me question why I don't spend more time working on this myself lol

New comment by formalsystem in "PyTorch Native Architecture Optimization: Torchao"

formalsystem — Tue, 01 Oct 2024 02:21:08 +0000

There's a bunch of overhead associated with PTQ - but TL;DR is that much of that overhead goes away when you're using `torch.compile()` and `torchao.autoquant()`

Essentially the latency overhead comes from quantizing and dequantizing weights and activations. For large layers this overhead is small because by quantizing your weights for example you reduce memory bandwidth pressure but for small layers the overhead of potentially looking up a table, reading scaling factors, quantization/dequantization and finally handling zero points might not be worth it.

However, even if such overhead exists you can still quantize your model and get it to be smaller it might not be faster is the problem. We solve the speed problem in 2 ways - `torch.compile()` will fuse operations like a dequant and matmul into a single kernel and `torchao.autoquant()` will do kernel level profiling to see whether a layer is actually made faster when quantizing and if not it skips quantizing that layer.

New comment by formalsystem in "PyTorch Native Architecture Optimization: Torchao"

formalsystem — Tue, 01 Oct 2024 02:15:11 +0000

Most of our performance relies on leveraging torch.compile which generates Triton kernels which run fast on CPU and GPU but not MPS since Triton does not support generating Metal kernels. So you lose the nice story of writing low bit code in pure PyTorch but also get it running fast.

In these cases the only path forward we have is writing custom Metal kernels and plugging those in. That work is still ongoing and we'll hopefully have more to share soon.