Hacker News: Remnant44

New comment by Remnant44 in "The happiest I've ever been"

Remnant44 — Sat, 28 Feb 2026 20:02:41 +0000

There's a whole lot of us out there. I don't know if there's still a future in the thing that I love, which is where all the malaise comes from.

New comment by Remnant44 in "AMD64 Bit Matrix Multiply and Bit Reversal Instructions"

Remnant44 — Sun, 01 Feb 2026 03:55:38 +0000

I love me some isa extension, I'd love to know what these are intended and useful for for though. 1 bit inference? I hear they could be useful in crypto as well, but that's out of my field.

New comment by Remnant44 in "Tell HN: Bending Spoons laid off almost everybody at Vimeo yesterday"

Remnant44 — Wed, 21 Jan 2026 22:06:39 +0000

While you may be correct in the sense that, in a public acquisition statement, people should be inferring enormous context and not taking anything said at face value.

It's simultaneously true that this is the farthest thing from effective, honest, and clear communication. Reading between the lines here is required precisely because we all know that any acquisition statements made are, at best heavily coded, if not completely just fluff.

You can recognize that and still get angry that it's par for the course for such things to be not just devoid of useful information, but often actively deceiving.

New comment by Remnant44 in "AVX-512: First Impressions on Performance and Programmability"

Remnant44 — Mon, 19 Jan 2026 19:05:39 +0000

Fort what it's worth, I had the exact same experience you did when I started writing SIMD code explicitly with intrinsics.

I avoided it for a long time because, well, it was so damn ugly and verbose to do simple things. However, in actual practice it's not nearly as painful as it looks, and you get used to it quickly.

New comment by Remnant44 in "AVX-512: First Impressions on Performance and Programmability"

Remnant44 — Mon, 19 Jan 2026 18:46:19 +0000

AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction. So you/the compiler are free to use the reg,mem form interchangibly with unaligned data.

The penalty on modern machines is an extra cycle of latency and, when crossing a cacheline, half the throughput (AVX512 always crosses a cacheline since they are cacheline sized!). These are pretty mild penalties given what you gain! So while it's true that peak L1 cache performance is gained when everything is aligned.. the blocker is elsewhere for most real code.

New comment by Remnant44 in "AVX-512: First Impressions on Performance and Programmability"

Remnant44 — Mon, 19 Jan 2026 07:17:58 +0000

There are many situations where your data is essentially _majority_ unaligned. Considerable effort by the hardware guys has gone into making that situation work well.

A great example would be a convolution-kernel style code - with AVX512 you are using 64 bytes at a time (a whole cacheline), and sampling a +- N element neighborhood around a pixel. By definition most of those reads will be unaligned!

A lot of other great use cases for SIMD don't let you dictate the buffer alignment. If the code is constrained by bandwidth over compute, I have found it to be worth doing a head/body/tail situation where you do one misaligned iteration before doing the bulk of the work in alignment, but honestly for that to be worth it you have to be working almost completely out of L1 cache which is rare... otherwise you're going to be slowed down to L2 or memory speed anyways, at which point the half rate penalty doesn't really matter.

The early SSE-style instructions often favored making two aligned reads and then extracting your sliding window from that, but there's just no point doing that on modern hardware - it will be slower.

New comment by Remnant44 in "AVX-512: First Impressions on Performance and Programmability"

Remnant44 — Mon, 19 Jan 2026 04:53:08 +0000

which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.

No reason for the compiler to balk at vectorizing unaligned data these days.

New comment by Remnant44 in "The state of SIMD in Rust in 2025"

Remnant44 — Thu, 06 Nov 2025 09:04:01 +0000

In practical use for simd, various min/max operations. On Intel at least, they propagate nan or not based on operand order

New comment by Remnant44 in "Better SRGB to Greyscale Conversion"

Remnant44 — Sun, 19 Oct 2025 23:04:15 +0000

I've run into this as well. Problem is that linear RGB is most definitely not a perceptually uniform space, so blending in it frequently does something different than you want. Use linear for physically based light and mixing, but if you are modeling an operation that is based on human perception it is going to be completely wrong.

The dark irony then, is that sRGB with its gamma curve applied, models luminance better (closer to human perception) for blending than linear does. If you can afford to do the blend in a perceptually uniform space like oklab, even better of course.

New comment by Remnant44 in "Apple M5 chip"

Remnant44 — Wed, 15 Oct 2025 18:33:18 +0000

Essentially ever other use case for a computer.

Whether you're playing games, or editing videos, or doing 3D work, or trying to digest the latest bloated react mess on some website.. ;)

New comment by Remnant44 in "Daniel Kahneman opted for assisted suicide in Switzerland"

Remnant44 — Sun, 12 Oct 2025 09:05:17 +0000

I've had just the smallest touch of this caring for my elderly parents, and you have my deep empathy. It's exhausting and really really hard.

New comment by Remnant44 in "Why we need SIMD"

Remnant44 — Wed, 08 Oct 2025 23:15:53 +0000

totally - especially given how bandwidth constrained CPUs still are, going wider than 512 doesn't make much sense. 512 itself was a stretch for quite a long time (and all the negative press on the original implementations was a consequence of being not-quite-ready for primetime), but for current hardware I think it's perfect.

But 128bit is just ancient. If you're going to go to significant trouble to rewrite your code in SIMD, you want to at least get a decent perf return on investment!

New comment by Remnant44 in "Why we need SIMD"

Remnant44 — Wed, 08 Oct 2025 22:53:29 +0000

Sure.. in detail and abstracted slightly, the byte table problem:

Maybe you're remapping RGB values [0..255] with a tone curve in graphics, or doing a mapping lookup of IDs to indexes in a set, or a permutation table, or .. well, there's a lot of use cases, right? This is essentially an arbitrary function lookup where the domain and range is on bytes.

It looks like this in scalar code:

transform_lut(byte* dest, const byte* src, int size, const byte* lut) { for (int i = 0; i < size; i++) { dest[i] = lut[src[i]]; } }

The function above is basically load/store limited - it's doing negligible arithmetic, just loading a byte from the source, using that to index a load into the table, and then storing the result to the destination. So two loads and a store per element. Zen5 has 4 load pipes and 2 store pipes, so our CPU can do two elements per cycle in scalar code. (Zen4 has only 1 store pipe, so 1 per cycle there)

Here's a snippet of the AVX512 version.

You load the lookup table into 4 registers outside the loop:

  __m512i p0, p1, p2, p3;
  p0 = _mm512_load_epi8(lut);
  p1 = _mm512_load_epi8(lut + 64);
  p2 = _mm512_load_epi8(lut + 128);
  p3 = _mm512_load_epi8(lut + 192);

Then, for each SIMD vector of 64 elements, use each lane's value as an index into the lookup table, just like the scalar version. Since we only can use 128 bytes, we DO have to do it twice, once for the lower and again for the upper half, and use a mask to choose between them appropriately on a per-element basis.

  auto tLow  = _mm512_permutex2var_epi8(p0, x, p1);
  auto tHigh = _mm512_permutex2var_epi8(p2, x, p3);

You can use _mm512_movepi8_mask to load the mask register. That instruction sets each lane is active if its high bit of the byte is set, which perfectly sets up our table. You could use the mask register directly on the second shuffle instruction or a later blend instruction, it doesn't really matter.

For every 64 bytes, the avx512 version has one load&store and does two permutes, which Zen5 can do at 2 a cycle. So 64 elements per cycle.

So our theoretical speedup here is ~32x over the scalar code! You could pull tricks like this with SSE and pshufb, but the size of the lookup table is too small to really be useful. Being able to do an arbitrary super-fast byte-byte transform is incredibly useful.

New comment by Remnant44 in "Why we need SIMD"

Remnant44 — Wed, 08 Oct 2025 19:59:12 +0000

Yes and no. I think neon is undersized for today at 128bit registers -- if you're working with doubles for example, that's only two values per register, which is pretty anemic. Things like shuffles and other tricky bitops benefit from wider widths as well (see my other reply)

New comment by Remnant44 in "Why we need SIMD"

Remnant44 — Wed, 08 Oct 2025 19:43:58 +0000

I'm just happy that finally, with the popularity of zen4 and 5 chips, AVX512 is around ~20% of the running hardware in the steam hardware survey. It's going to be a long while before it gets to a majority - Intel still isn't shipping its own instruction set in consumer CPUs - but its going the right direction.

Compared to the weird, lumpy lego set of avx1/2, avx512 is quite enjoyable to write with, and still has some fun instructions that deliver more than just twice the width.

Personal example: The double width byte shuffles (_mm512_permutex2var_epi8) that takes 128 bytes as input in two registers. I had a critical inner loop that uses a 256 byte lookup table; running an upper/lower double-shuffle and blending them essentially pops out 64 answers a cycle from the lookup table on zen5 (which has two shuffle units), which is pretty incredible, and on its own produced a global 4x speedup for the kernel as a whole.

New comment by Remnant44 in "AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip"

Remnant44 — Wed, 01 Oct 2025 17:37:53 +0000

I think the most interesting thing here is the near-lack of NUMA effects on memory access, giving fairly easy to achieve high bandwidth memory.

Combined with the double-width fabric links, it's an interesting part because it gives a glimpse into some of the expected directions that AMD is going for zen6 (faster interconnect, dual memory controllers)

New comment by Remnant44 in "Ask HN: Why hasn't x86 caught up with Apple M series?"

Remnant44 — Tue, 26 Aug 2025 07:47:46 +0000

For sure.. for what it's worth though, I have run across several references to arm also implementing uop caches as a power optimization versus just running the decoders, so I'm inclined to say that whatever it's cost it pays for itself. I am not a chip designer though!

New comment by Remnant44 in "Ask HN: Why hasn't x86 caught up with Apple M series?"

Remnant44 — Tue, 26 Aug 2025 07:43:17 +0000

I was going to mention this as well.

Source: chipsandcheese.com memory latency graphs

New comment by Remnant44 in "Ask HN: Why hasn't x86 caught up with Apple M series?"

Remnant44 — Mon, 25 Aug 2025 23:58:32 +0000

ARM instructions are fixed size, while x86 are variable. This makes a wide decoder fairly trivial for ARM, while it is complex and difficult for x86.

However, this doesn't really hold up as the cause for the difference. The Zen4/5 chips, for example, source the vast majority of their instructions out of their uOp trace cache, where the instructions have already been decoded. This also saves power - even on ARM, decoders take power.

People have been trying to figure out the "secret sauce" since the M chips have been introduced. In my opinion, it's a combination of:

1) The apple engineers did a superb job creating a well balanced architecture

2) Being close to their memory subsystem with lots of bandwidth and deep buffers so they can use it is great. For example, my old M2 Pro macbook has more than twice the memory bandwidth than the current best desktop CPU, the zen5 9950x. That's absurd, but here we are...

3) AMD and Intel heavily bias on the costly side of the watts vs performance curve. Even the compact zen cores are optimized more for area than wattage. I'm curious what a true low power zen core (akin to the apple e cores) would do.

New comment by Remnant44 in "GitHub Copilot Coding Agent"

Remnant44 — Wed, 21 May 2025 18:02:00 +0000

Man, I miss Joel's blog. So much developer wisdom that is still relevant even if aged now.