<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: dsharlet</title><link>https://news.ycombinator.com/user?id=dsharlet</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sat, 13 Jun 2026 06:27:43 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=dsharlet" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by dsharlet in "Can the stockmarket swallow Anthropic, SpaceX and OpenAI?"]]></title><description><![CDATA[
<p>I can agree with the skeptics that LLM generated code is usually crap. I rarely accept its output without significant edits unless it's truly boilerplate, and I want to avoid the need for that kind of code in the first place.<p>For me, the killer use case is debugging. I hate wasting time debugging something that should work except for mistakes, and now I do that probably 75% less than I used to because AI does it for me.<p>I don't know if it makes me that much more productive, but I certainly enjoy my work more not having to do as much tedious debugging, and it feels like I waste a lot less time doing it.</p>
]]></description><pubDate>Tue, 02 Jun 2026 04:37:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=48366115</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=48366115</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48366115</guid></item><item><title><![CDATA[New comment by dsharlet in "Demystifying ARM SME to Optimize General Matrix Multiplications"]]></title><description><![CDATA[
<p>Yes I am, you can reach me at dsharlet@gmail.com</p>
]]></description><pubDate>Mon, 02 Feb 2026 17:54:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=46858946</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=46858946</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46858946</guid></item><item><title><![CDATA[New comment by dsharlet in "Demystifying ARM SME to Optimize General Matrix Multiplications"]]></title><description><![CDATA[
<p>BLIS doesn't appear to support SME: <a href="https://github.com/search?q=repo%3Aflame%2Fblis+mopa&type=code" rel="nofollow">https://github.com/search?q=repo%3Aflame%2Fblis+mopa&type=co...</a><p>Maybe you want a comparison anyways, but it won't be competitive. On Apple CPUs, SME is ~8x faster than a single regular CPU core with a good BLAS library.</p>
]]></description><pubDate>Sat, 31 Jan 2026 21:02:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=46840824</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=46840824</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46840824</guid></item><item><title><![CDATA[New comment by dsharlet in "ML needs a new programming language – Interview with Chris Lattner"]]></title><description><![CDATA[
<p>The problem I've seen is this: in order to get good performance, no matter what language you use, you need to understand the hardware and how to use the instructions you want to use. It's not enough to know that you want to use tensor cores or whatever, you also need to understand the myriad low level requirements they have.<p>Most people that know this kind of thing don't get much value out of using a high level language to do it, and it's a huge risk because if the language fails to generate something that you want, you're stuck until a compiler team fixes and ships a patch which could take weeks or months. Even extremely fast bug fixes are still extremely slow on the timescales people want to work on.<p>I've spent a lot of my career trying to make high level languages for performance work well, and I've basically decided that the sweet spot for me is C++ templates: I can get the compiler to generate a lot of good code concisely, and when it fails the escape hatch of just writing some architecture specific intrinsics is right there whenever it is needed.</p>
]]></description><pubDate>Fri, 05 Sep 2025 18:14:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=45141743</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=45141743</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45141743</guid></item><item><title><![CDATA[New comment by dsharlet in "Einsum for Tensor Manipulation"]]></title><description><![CDATA[
<p>I wrote a library in C++ (I know, probably a non-starter for most reading this) that I think does most of what you want, as well as some other requests in this thread (generalized to more than just multiply-add): <a href="https://github.com/dsharlet/array?tab=readme-ov-file#einstein-reductions">https://github.com/dsharlet/array?tab=readme-ov-file#einstei...</a>.<p>A matrix multiply written with this looks like this:<p><pre><code>    enum { i = 2, j = 0, k = 1 };
    auto C = make_ein_sum<float, i, j>(ein<i, k>(A) * ein<k, j>(B));
</code></pre>
Where A and B are 2D arrays. This is strongly typed all the way through, so you get a lot of feedback at compile time, and C is 2D array object at compile time. It is <i>possible</i> to make C++ template errors reasonable with enable_if and the like, this works well-ish on clang, but not so well in GCC (YMMV).<p>This library is a lot less automated than most other einsum implementations. You have to explicitly control the loop ordering (in the example above, the `j` loop is innermost because it is loop 0). If you build a good loop order for your shapes, the compiler will probably autovectorize your inner loop, and you'll get pretty good performance. Control over the loop ordering is in general a useful tool, but it's probably a lot lower level than most users want.</p>
]]></description><pubDate>Sun, 28 Apr 2024 02:30:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=40185504</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=40185504</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40185504</guid></item><item><title><![CDATA[New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"]]></title><description><![CDATA[
<p>I dunno man. My claim was that for <i>specific cases</i> with unique properties, it's not hard to beat BLAS, without getting too exotic with your code. BLAS doesn't have routines for multiplies with non-contiguous data, various patterns of sparsity, mixed precision inputs/outputs, etc. The example I gave is for a specific case close-ish to the case I cared about.<p>You're changing it to a very different case, presumably one that you cared about, although 4096x4096 is oddly square and a very clean power of 2... I said right at the beginning of this long digression that what is hard about reproducing BLAS is its generality.</p>
]]></description><pubDate>Thu, 04 Jan 2024 22:25:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=38873144</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=38873144</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38873144</guid></item><item><title><![CDATA[New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"]]></title><description><![CDATA[
<p>BLAS is getting almost exactly 100% of the theoretical peak performance of my machine (CPU frequncy * 2 fmadd/cycle * 8 lanes * 2 ops/lane), it's not slow. I mean, just look at the profiler output...<p>You're probably now comparing parallel code to single threaded code.</p>
]]></description><pubDate>Thu, 04 Jan 2024 18:30:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=38870521</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=38870521</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38870521</guid></item><item><title><![CDATA[New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"]]></title><description><![CDATA[
<p>The makefile asks for -O2 with clang. I find that -O3 almost never helps in clang. (In gcc it does.)<p>Here's what I see:<p><pre><code>   $ clang++ --version
   clang version 18.0.0

   $ time make bin/matrix
   mkdir -p bin
   clang++ -I../../include -I../ -o bin/matrix matrix.cpp  -O2 -march=native -ffast-math -fstrict-aliasing -fno-exceptions -DNDEBUG -DBLAS  -std=c++14 -Wall -lstdc++ -lm -lblas
   1.25user 0.29system 0:02.74elapsed 56%CPU (0avgtext+0avgdata 126996maxresident)k
   159608inputs+120outputs (961major+25661minor)pagefaults 0swaps

   $ bin/matrix
   ...
   reduce_tiles_z_order time: 3.86099 ms, 117.323 GFLOP/s
   blas time: 0.533486 ms, 849.103 GFLOP/s

   $ OMP_NUM_THREADS=1 bin/matrix
   ...
   reduce_tiles_z_order time: 3.89488 ms, 116.303 GFLOP/s
   blas time: 3.49714 ms, 129.53 GFLOP/s
</code></pre>
My inner loop in perf: <a href="https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b0fdad" rel="nofollow">https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b...</a>
BLAS inner loop in perf (a chunk of it, it is unrolled massively): <a href="https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc422392d" rel="nofollow">https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc42...</a><p>Despite being on a current-ish version of clang, I've been getting similar results from clang for years now.<p>Anyways, I'm not going to debate any further. It works for me :) If you want to keep writing code the way you have, go for it.</p>
]]></description><pubDate>Thu, 04 Jan 2024 16:30:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=38868948</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=38868948</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38868948</guid></item><item><title><![CDATA[New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"]]></title><description><![CDATA[
<p>I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.<p>As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: <a href="https://github.com/dsharlet/array/blob/master/examples/linear_algebra/matrix.cpp#L133">https://github.com/dsharlet/array/blob/master/examples/linea...</a></p>
]]></description><pubDate>Thu, 04 Jan 2024 01:59:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=38862207</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=38862207</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38862207</guid></item><item><title><![CDATA[New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"]]></title><description><![CDATA[
<p>This gets to 90% of BLAS: <a href="https://github.com/dsharlet/array/blob/38f8ce332fc4e26af08325ad0654c8516a445e8c/examples/linear_algebra/matrix.cpp#L271">https://github.com/dsharlet/array/blob/38f8ce332fc4e26af0832...</a><p>The less involved versions still get ~70%.<p>But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.</p>
]]></description><pubDate>Wed, 03 Jan 2024 23:35:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=38861265</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=38861265</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38861265</guid></item><item><title><![CDATA[New comment by dsharlet in "Benchmarking 20 programming languages on N-queens and matrix multiplication"]]></title><description><![CDATA[
<p>This is really overstating how hard it is to compete with matrix multiply libraries. The main reason those libraries are so big and have had so much work invested in them is their generality: they're reasonably fast for almost any kind of inputs.<p>If you have a specific problem with constraints you can exploit (e.g. known fixed dimensions, sparsity patterns, data layouts, type conversions, etc.), it's not hard at all to beat MKL, etc... <i>if</i> you are using a language like C++. If you are using python, you have no chance.<p>It isn't even necessarily that different from a few nested loops. Clang is pretty damn good at autovectorizing, you just have to be a little careful about how you write the code.</p>
]]></description><pubDate>Wed, 03 Jan 2024 06:24:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=38851344</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=38851344</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38851344</guid></item><item><title><![CDATA[New comment by dsharlet in "Mojo is now available on Mac"]]></title><description><![CDATA[
<p>I think for a compiler it makes sense to focus on small matrix multiplies, which are a building block of larger matrix multiplies anyways. Small matrix multiplies emphasize the compiler/code generation quality. Even vanilla python overhead might be insignificant when gluing small-ish matrix multiplies together to do a big multiply.</p>
]]></description><pubDate>Thu, 19 Oct 2023 18:09:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=37946363</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=37946363</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37946363</guid></item><item><title><![CDATA[New comment by dsharlet in "Mojo is now available on Mac"]]></title><description><![CDATA[
<p>I tried it out and compared it to C++ at the last release. Here was what I found: <a href="https://github.com/dsharlet/mojo_comments">https://github.com/dsharlet/mojo_comments</a><p>Some of the issues I pointed out there are pretty low hanging fruit for LLVM, so they may have improved a lot in more current releases.</p>
]]></description><pubDate>Thu, 19 Oct 2023 16:54:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=37945346</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=37945346</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37945346</guid></item><item><title><![CDATA[New comment by dsharlet in "Qualcomm’s Hexagon DSP, and Now, NPU"]]></title><description><![CDATA[
<p>I don't think it's true at all, you can just write C/C++ with SIMD intrinsics, just like you can on ARM or x86, and the instruction set is mostly awesome. OpenCL would just be an extra layer to get in the way. If you want to just run some existing OpenCL code, I guess that would be nice, but I doubt any OpenCL code written for a GPU would actually run well on Hexagon anyways.<p>The article also complains about VLIW in the same paragraph, but I don't think VLIW makes things harder, it just makes problems more obvious. If you write ARM or x86 code that has dependencies between every instruction, that's going to suck too, you just won't know it until you run it, but VLIW will make it obvious if you just look at the generated code. For the kinds of programs that make sense to run on a processor like Hexagon, VLIW is fine.<p>The whole Hexagon environment is just so much better than any of the other similar DSPs I'm aware of: you can use open source LLVM to compile code for it (so you aren't stuck with an old version of GCC), and the OS is much closer to standard (e.g. thread synchronization is just pthreads).<p>I did a bunch of work on Hexagon and I like it a lot. It is my favorite in its class.</p>
]]></description><pubDate>Thu, 05 Oct 2023 03:58:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=37774776</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=37774776</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37774776</guid></item><item><title><![CDATA[New comment by dsharlet in "SiFive’s P870 Takes RISC-V Further"]]></title><description><![CDATA[
<p>> The only case I can currently fore see where using LMUL=1 and manually unrolling instead will likely be always beneficial is vrgather operations that don't need to cross between registers in a register group (e.g. byte swapping).<p>What about algorithms where register pressure is an issue?<p>I think the problem with LMUL is it assumes that you always want to unroll the innermost dimension (where the vector loads are stride 1). That's usually, the <i>last</i> dimension I try to unroll, if there are any registers left over. If there is any sharing of data across any other dimension in the algorithm, it's better to tile/unroll those first.<p>Of course, for a simple algorithm, there will be registers left over. But I think more interesting algorithms will struggle on RVV if you <i>must</i> use LMUL > 1 for performance.</p>
]]></description><pubDate>Mon, 04 Sep 2023 22:08:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=37385602</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=37385602</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37385602</guid></item><item><title><![CDATA[New comment by dsharlet in "A basic introduction to NumPy's einsum"]]></title><description><![CDATA[
<p>If you are looking for something like this in C++, here's my attempt at implementing it: <a href="https://github.com/dsharlet/array#einstein-reductions" rel="nofollow">https://github.com/dsharlet/array#einstein-reductions</a><p>It doesn't do any automatic optimization of the loops like some of the projects linked in this thread, but, it provides all the tools needed for humans to express the code in a way that a good compiler can turn it into really good code.</p>
]]></description><pubDate>Sun, 10 Apr 2022 01:54:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=30973806</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=30973806</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=30973806</guid></item><item><title><![CDATA[New comment by dsharlet in "A basic introduction to NumPy's einsum"]]></title><description><![CDATA[
<p>Compilers can be pretty good if you help them out a bit. Here's my implementation of Einstein reductions (including summations) in C++, which generate pretty close to ideal code until you start getting into processor architecture specific optimizations: <a href="https://github.com/dsharlet/array#einstein-reductions" rel="nofollow">https://github.com/dsharlet/array#einstein-reductions</a></p>
]]></description><pubDate>Sun, 10 Apr 2022 01:51:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=30973795</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=30973795</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=30973795</guid></item><item><title><![CDATA[New comment by dsharlet in "Deep Learning for Guitar Effect Emulation"]]></title><description><![CDATA[
<p>Funny you mention SPICE to VST compilation... It was on my list for this (my) side project but I never got around to it: <a href="http://livespice.org/" rel="nofollow">http://livespice.org/</a><p>edit: And a Tubescreamer is one of the examples!</p>
]]></description><pubDate>Tue, 12 May 2020 02:52:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=23150106</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=23150106</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=23150106</guid></item><item><title><![CDATA[New comment by dsharlet in "A half century ago, better transistors revolutionized computer power supplies"]]></title><description><![CDATA[
<p>> which introduces some frequency/phase distortion<p>I agree with everything you said here, except this. Can you explain this? Why does oversampling introduce distortion?</p>
]]></description><pubDate>Mon, 10 Feb 2020 04:51:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=22286501</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=22286501</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=22286501</guid></item><item><title><![CDATA[New comment by dsharlet in "MacOS may lose data on APFS-formatted disk images"]]></title><description><![CDATA[
<p>The rest of the post after the question mark you stopped reading at:<p>> Filesystem corruption is frequently silent, and every-time it happens customers don't get on the phone and send the disks to apple so that they can root cause the problem. Its quite possible this bug has happened an untold number of times before it happened to someone who went through the effort to reproduce and isolate it.</p>
]]></description><pubDate>Sun, 18 Feb 2018 00:44:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=16403739</link><dc:creator>dsharlet</dc:creator><comments>https://news.ycombinator.com/item?id=16403739</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=16403739</guid></item></channel></rss>