Hacker News: dsharlet

New comment by dsharlet in "Can the stockmarket swallow Anthropic, SpaceX and OpenAI?"

dsharlet — Tue, 02 Jun 2026 04:37:28 +0000

I can agree with the skeptics that LLM generated code is usually crap. I rarely accept its output without significant edits unless it's truly boilerplate, and I want to avoid the need for that kind of code in the first place.

For me, the killer use case is debugging. I hate wasting time debugging something that should work except for mistakes, and now I do that probably 75% less than I used to because AI does it for me.

I don't know if it makes me that much more productive, but I certainly enjoy my work more not having to do as much tedious debugging, and it feels like I waste a lot less time doing it.

New comment by dsharlet in "Demystifying ARM SME to Optimize General Matrix Multiplications"

dsharlet — Mon, 02 Feb 2026 17:54:03 +0000

Yes I am, you can reach me at dsharlet@gmail.com

New comment by dsharlet in "Demystifying ARM SME to Optimize General Matrix Multiplications"

dsharlet — Sat, 31 Jan 2026 21:02:50 +0000

BLIS doesn't appear to support SME: https://github.com/search?q=repo%3Aflame%2Fblis+mopa&type=co...

Maybe you want a comparison anyways, but it won't be competitive. On Apple CPUs, SME is ~8x faster than a single regular CPU core with a good BLAS library.

New comment by dsharlet in "ML needs a new programming language – Interview with Chris Lattner"

dsharlet — Fri, 05 Sep 2025 18:14:00 +0000

The problem I've seen is this: in order to get good performance, no matter what language you use, you need to understand the hardware and how to use the instructions you want to use. It's not enough to know that you want to use tensor cores or whatever, you also need to understand the myriad low level requirements they have.

Most people that know this kind of thing don't get much value out of using a high level language to do it, and it's a huge risk because if the language fails to generate something that you want, you're stuck until a compiler team fixes and ships a patch which could take weeks or months. Even extremely fast bug fixes are still extremely slow on the timescales people want to work on.

I've spent a lot of my career trying to make high level languages for performance work well, and I've basically decided that the sweet spot for me is C++ templates: I can get the compiler to generate a lot of good code concisely, and when it fails the escape hatch of just writing some architecture specific intrinsics is right there whenever it is needed.

New comment by dsharlet in "Einsum for Tensor Manipulation"

dsharlet — Sun, 28 Apr 2024 02:30:08 +0000

I wrote a library in C++ (I know, probably a non-starter for most reading this) that I think does most of what you want, as well as some other requests in this thread (generalized to more than just multiply-add): https://github.com/dsharlet/array?tab=readme-ov-file#einstei....

A matrix multiply written with this looks like this:

    enum { i = 2, j = 0, k = 1 };
    auto C = make_ein_sum(ein(A) * ein(B));

Where A and B are 2D arrays. This is strongly typed all the way through, so you get a lot of feedback at compile time, and C is 2D array object at compile time. It is possible to make C++ template errors reasonable with enable_if and the like, this works well-ish on clang, but not so well in GCC (YMMV).

This library is a lot less automated than most other einsum implementations. You have to explicitly control the loop ordering (in the example above, the `j` loop is innermost because it is loop 0). If you build a good loop order for your shapes, the compiler will probably autovectorize your inner loop, and you'll get pretty good performance. Control over the loop ordering is in general a useful tool, but it's probably a lot lower level than most users want.

New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"

dsharlet — Thu, 04 Jan 2024 22:25:30 +0000

I dunno man. My claim was that for specific cases with unique properties, it's not hard to beat BLAS, without getting too exotic with your code. BLAS doesn't have routines for multiplies with non-contiguous data, various patterns of sparsity, mixed precision inputs/outputs, etc. The example I gave is for a specific case close-ish to the case I cared about.

You're changing it to a very different case, presumably one that you cared about, although 4096x4096 is oddly square and a very clean power of 2... I said right at the beginning of this long digression that what is hard about reproducing BLAS is its generality.

New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"

dsharlet — Thu, 04 Jan 2024 18:30:47 +0000

BLAS is getting almost exactly 100% of the theoretical peak performance of my machine (CPU frequncy * 2 fmadd/cycle * 8 lanes * 2 ops/lane), it's not slow. I mean, just look at the profiler output...

You're probably now comparing parallel code to single threaded code.

New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"

dsharlet — Thu, 04 Jan 2024 16:30:43 +0000

The makefile asks for -O2 with clang. I find that -O3 almost never helps in clang. (In gcc it does.)

Here's what I see:

   $ clang++ --version
   clang version 18.0.0

   $ time make bin/matrix
   mkdir -p bin
   clang++ -I../../include -I../ -o bin/matrix matrix.cpp  -O2 -march=native -ffast-math -fstrict-aliasing -fno-exceptions -DNDEBUG -DBLAS  -std=c++14 -Wall -lstdc++ -lm -lblas
   1.25user 0.29system 0:02.74elapsed 56%CPU (0avgtext+0avgdata 126996maxresident)k
   159608inputs+120outputs (961major+25661minor)pagefaults 0swaps

   $ bin/matrix
   ...
   reduce_tiles_z_order time: 3.86099 ms, 117.323 GFLOP/s
   blas time: 0.533486 ms, 849.103 GFLOP/s

   $ OMP_NUM_THREADS=1 bin/matrix
   ...
   reduce_tiles_z_order time: 3.89488 ms, 116.303 GFLOP/s
   blas time: 3.49714 ms, 129.53 GFLOP/s

My inner loop in perf: https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b... BLAS inner loop in perf (a chunk of it, it is unrolled massively): https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc42...

Despite being on a current-ish version of clang, I've been getting similar results from clang for years now.

Anyways, I'm not going to debate any further. It works for me :) If you want to keep writing code the way you have, go for it.

New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"

dsharlet — Thu, 04 Jan 2024 01:59:38 +0000

I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.

As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...

New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"

dsharlet — Wed, 03 Jan 2024 23:35:01 +0000

This gets to 90% of BLAS: https://github.com/dsharlet/array/blob/38f8ce332fc4e26af0832...

The less involved versions still get ~70%.

But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.

New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"

dsharlet — Wed, 03 Jan 2024 06:24:06 +0000

This is really overstating how hard it is to compete with matrix multiply libraries. The main reason those libraries are so big and have had so much work invested in them is their generality: they're reasonably fast for almost any kind of inputs.

If you have a specific problem with constraints you can exploit (e.g. known fixed dimensions, sparsity patterns, data layouts, type conversions, etc.), it's not hard at all to beat MKL, etc... if you are using a language like C++. If you are using python, you have no chance.

It isn't even necessarily that different from a few nested loops. Clang is pretty damn good at autovectorizing, you just have to be a little careful about how you write the code.

New comment by dsharlet in "Mojo is now available on Mac"

dsharlet — Thu, 19 Oct 2023 18:09:50 +0000

I think for a compiler it makes sense to focus on small matrix multiplies, which are a building block of larger matrix multiplies anyways. Small matrix multiplies emphasize the compiler/code generation quality. Even vanilla python overhead might be insignificant when gluing small-ish matrix multiplies together to do a big multiply.

New comment by dsharlet in "Mojo is now available on Mac"

dsharlet — Thu, 19 Oct 2023 16:54:34 +0000

I tried it out and compared it to C++ at the last release. Here was what I found: https://github.com/dsharlet/mojo_comments

Some of the issues I pointed out there are pretty low hanging fruit for LLVM, so they may have improved a lot in more current releases.

New comment by dsharlet in "Qualcomm’s Hexagon DSP, and Now, NPU"

dsharlet — Thu, 05 Oct 2023 03:58:16 +0000

I don't think it's true at all, you can just write C/C++ with SIMD intrinsics, just like you can on ARM or x86, and the instruction set is mostly awesome. OpenCL would just be an extra layer to get in the way. If you want to just run some existing OpenCL code, I guess that would be nice, but I doubt any OpenCL code written for a GPU would actually run well on Hexagon anyways.

The article also complains about VLIW in the same paragraph, but I don't think VLIW makes things harder, it just makes problems more obvious. If you write ARM or x86 code that has dependencies between every instruction, that's going to suck too, you just won't know it until you run it, but VLIW will make it obvious if you just look at the generated code. For the kinds of programs that make sense to run on a processor like Hexagon, VLIW is fine.

The whole Hexagon environment is just so much better than any of the other similar DSPs I'm aware of: you can use open source LLVM to compile code for it (so you aren't stuck with an old version of GCC), and the OS is much closer to standard (e.g. thread synchronization is just pthreads).

I did a bunch of work on Hexagon and I like it a lot. It is my favorite in its class.

New comment by dsharlet in "SiFive’s P870 Takes RISC-V Further"

dsharlet — Mon, 04 Sep 2023 22:08:50 +0000

> The only case I can currently fore see where using LMUL=1 and manually unrolling instead will likely be always beneficial is vrgather operations that don't need to cross between registers in a register group (e.g. byte swapping).

What about algorithms where register pressure is an issue?

I think the problem with LMUL is it assumes that you always want to unroll the innermost dimension (where the vector loads are stride 1). That's usually, the last dimension I try to unroll, if there are any registers left over. If there is any sharing of data across any other dimension in the algorithm, it's better to tile/unroll those first.

Of course, for a simple algorithm, there will be registers left over. But I think more interesting algorithms will struggle on RVV if you must use LMUL > 1 for performance.

New comment by dsharlet in "A basic introduction to NumPy's einsum"

dsharlet — Sun, 10 Apr 2022 01:54:08 +0000

If you are looking for something like this in C++, here's my attempt at implementing it: https://github.com/dsharlet/array#einstein-reductions

It doesn't do any automatic optimization of the loops like some of the projects linked in this thread, but, it provides all the tools needed for humans to express the code in a way that a good compiler can turn it into really good code.

New comment by dsharlet in "A basic introduction to NumPy's einsum"

dsharlet — Sun, 10 Apr 2022 01:51:49 +0000

Compilers can be pretty good if you help them out a bit. Here's my implementation of Einstein reductions (including summations) in C++, which generate pretty close to ideal code until you start getting into processor architecture specific optimizations: https://github.com/dsharlet/array#einstein-reductions

New comment by dsharlet in "Deep Learning for Guitar Effect Emulation"

dsharlet — Tue, 12 May 2020 02:52:42 +0000

Funny you mention SPICE to VST compilation... It was on my list for this (my) side project but I never got around to it: http://livespice.org/

edit: And a Tubescreamer is one of the examples!

New comment by dsharlet in "A half century ago, better transistors revolutionized computer power supplies"

dsharlet — Mon, 10 Feb 2020 04:51:49 +0000

> which introduces some frequency/phase distortion

I agree with everything you said here, except this. Can you explain this? Why does oversampling introduce distortion?

New comment by dsharlet in "MacOS may lose data on APFS-formatted disk images"

dsharlet — Sun, 18 Feb 2018 00:44:44 +0000

The rest of the post after the question mark you stopped reading at:

> Filesystem corruption is frequently silent, and every-time it happens customers don't get on the phone and send the disks to apple so that they can root cause the problem. Its quite possible this bug has happened an untold number of times before it happened to someone who went through the effort to reproduce and isolate it.