Hacker News: hairtuq

New comment by hairtuq in "When Compiler Optimizations Hurt Performance"

hairtuq — Tue, 21 Oct 2025 19:15:54 +0000

The mapping after we have the leading 1's count can be done in 3 instructions (in 32-bit math) on either x86 or ARM:

    t = 0x2020c6 >> x
    return (t << (t & 31)) >> 27

New comment by hairtuq in "How we made JSON.stringify more than twice as fast"

hairtuq — Wed, 06 Aug 2025 21:19:45 +0000

If you want to optimize it a little more, you can combine isLess(s, 0x20) and isChar('"') into isLess(s ^ (kOnes * 0x02), 0x21) (this works since '"' is 0x22).

New comment by hairtuq in "A leap year check in three instructions"

hairtuq — Fri, 16 May 2025 05:37:22 +0000

Thanks, fixed.

New comment by hairtuq in "Optimizing with Novel Calendrical Algorithms"

hairtuq — Mon, 03 Feb 2025 22:51:24 +0000

The article leaves out optimizing is_leap_year. Here is an interesting version that is correct (in the Proleptic Gregorian calendar) for 0 <= year <= 102499 and takes only 3 instructions:

    ((year * 1073750999) & 3221352463) <= 126976

How this works is surprisingly complex, I am thinking of writing a whole blog post about it...

New comment by hairtuq in "Optimizing uint64_t Digit Counting: A Method that Beats Lemire's by up to 143%"

hairtuq — Wed, 08 Jan 2025 06:56:52 +0000

Here is a cute variant that doesn't need lzcnt nor tables, but only works up to 99,999:

    (((x + 393206) & (x + 524188)) ^ ((x + 916504) & (x + 514288))) >> 17

This is for integer log 10, but could be adapted for number of digits. It needs a wrapper for 64 bit to invoke it multiple times, but most numbers in a JSON are small, so it might even be competitive; it needs only 4 cycles with enough instruction level parallelism.

I gathered this idea from the output of a superoptimizer, it was fun to figure out how it works. For spoilers, see [1].

[1] https://github.com/rust-lang/rust/blob/master/library/core/s...

New comment by hairtuq in "Bitbanging 1D Reversible Automata"

hairtuq — Thu, 12 Dec 2024 23:58:07 +0000

The author wonders:

> In theory at least, the compiler can see that rule only has 256 values and create a reduced version of ca1d_rule_apply for each value. Whether it actually does is not of much practical concern when the rendering code is the bottle neck. However it’s interesting to see if the compiler can deduce the best solution or whether anything trips it up.

The compiler is unlikely to get the optimal result here. The core of this is finding the best instruction sequence for a ternary boolean operation encoded in 8 bits; it's the same job needed for emulating the AVX512F "vpternlog" instruction. This can always be done in at most 5 instructions (or 4 if you have andnot/ornot/xornot), but it's not straightforward to do this. Here is some code that calculates optimal instruction sequences (by letting z3 do the heavy lifting): https://github.com/falk-hueffner/ternary-logic-optimization

New comment by hairtuq in "Decoding UTF8 with parallel extract"

hairtuq — Mon, 06 May 2024 18:17:53 +0000

Similarly, the eexpect table can be done in 3 instructions with with t = 0x3c783023f + i; return ((t >> (t & 63)) & 0xf0808088.

New comment by hairtuq in "Show HN: Kuboble.com – Minimalistic sliding pieces puzzle game"

hairtuq — Wed, 08 Feb 2023 18:46:36 +0000

I wrote a solver for Atomix (https://github.com/falk-hueffner/atomixer), a similar game. Technically, the only difference is that in Atomix the solution location is not predefined, although probably based on level design this game actually feels quite different. Since the solver just tries all solution location anyway, it would be quite simple to adapt it to this game.

New comment by hairtuq in "A neat XOR trick"

hairtuq — Tue, 13 Dec 2022 18:48:36 +0000

On a related note, almost any task around shuffling or shifting bits can be improved with xor. For example the straightforward way to remove bit i is

    mask = -1 << i;
    return (x & ~mask) | ((x >> 1) & mask);

but with xor we can do it in one less instruction:

    mask = -1 << i;
    return ((x ^ (x >> 1)) & mask) ^ x;

New comment by hairtuq in "After Xinjiang Revelations, Germany's Ties to China Are Under the Microscope"

hairtuq — Wed, 01 Jun 2022 19:19:49 +0000

It seems you're shifting the goal posts now. The fact that consensus criticisms are not as harsh as you'd like doesn't mean that Israel is "totally off limits". Discussions and criticisms are happening; and the fact that public opinion doesn't match yours can not only be explained by that opinion being suppressed, but also by it being wrong.

New comment by hairtuq in "After Xinjiang Revelations, Germany's Ties to China Are Under the Microscope"

hairtuq — Tue, 31 May 2022 13:10:27 +0000

I don't see that. For example, it has been the official position of the German government for at least 20 years to strongly criticize the Israeli settlements. There's regular critical reports in the press, endless TV discussions and so on. Can you name someone who has actually had to suffer negative consequences for reasonable criticism of Israel?

New comment by hairtuq in "Show HN: PyIng – Ingredient parser"

hairtuq — Tue, 29 Mar 2022 10:38:17 +0000

I wrote an ingredient parser that is basically just one giant regular expression: https://github.com/falk-hueffner/metric-cooking/blob/master/...

The use case is a bit different (the task includes finding ingredients mentioned in a longer text, and it'll rather not parse something rather than parsing it wrong), but it works fairly OK and can even parse things like "3/4 cup plus 2 tablespoons packed light-brown sugar".

New comment by hairtuq in "Solving NP-hard puzzles with the oldest trick in the book"

hairtuq — Thu, 14 Oct 2021 15:03:04 +0000

For the quite similar puzzle Atomix, it also seems like the branching factor would be much higher for backward search because upper bounds are weaker, but you can show that on average the branching factor is actually the same [1]. I wonder if the same argument would work here.

[1] http://hueffner.de/falk/hueffner-studienarbeit-atomix.pdf Section 5.5

New comment by hairtuq in "Impact of Undefined Behavior on Performance"

hairtuq — Fri, 09 Jul 2021 18:12:59 +0000

You don't necessarily need a wider type (which might be slower to work with), you can just calculate (n|1) * ((n+1)/2). Clang does something much more complicated, probably because for it is just a special case of some much more general optimization.

New comment by hairtuq in "Counting the number of matching characters in two ASCII strings"

hairtuq — Sat, 22 May 2021 08:30:36 +0000

Slightly simpler:

    c = x ^ y;
    return popcount(((c >> 1) | 0x8080808080808080) - c) & 0x8080808080808080);

New comment by hairtuq in "Counting the number of matching characters in two ASCII strings"

hairtuq — Sat, 22 May 2021 07:42:33 +0000

Actually, e is the inverse of the xor, so (e & c1) is guaranteed to be c1, and you still need popcount(c1 & ((e & c2) + c3)).

New comment by hairtuq in "Bit Twiddling Hacks"

hairtuq — Fri, 04 Dec 2020 10:58:26 +0000

The name SWAR is sometimes used (https://en.wikipedia.org/wiki/SWAR).

Indeed, even 8-bit fields can be added in parallel, using the fact that ^ is like a + that does not produce a carry:

    uint64_t signmask = 0x8080808080808080;
    uint64_t sum_without_sign_bits = ((x & ~signmask) + (y & ~signmask));
    uint64_t sum_of_sign_bits = (x ^ y) & signmask;
    return sum_without_sign_bits ^ sum_of_sign_bits;

New comment by hairtuq in "Programming with a Read-Eval-Synth Loop [pdf]"

hairtuq — Tue, 20 Oct 2020 11:53:28 +0000

There is some work on automatically generating data structure implementations from high-level specifications: https://cozy.uwplse.org/

New comment by hairtuq in "Big O, Little N"

hairtuq — Sat, 03 Oct 2020 07:18:47 +0000

Since the shift count is implicitly modulo 32, you don't even need the subtraction:

    bool isvowel(char c) {
        return (1u << (c&31)) & 0x208222;
    }

New comment by hairtuq in "Turn recipe websites into plain text"

hairtuq — Fri, 26 Jun 2020 08:12:18 +0000

I wrote a browser extension that does this (although it will of course occasionally get something wrong): https://github.com/falk-hueffner/metric-cooking