Hacker News: pbsd

New comment by pbsd in "How many branches can your CPU predict?"

pbsd — Thu, 19 Mar 2026 18:35:20 +0000

I mean, he's comparing 2024 Zen 5 and M4 against two generations behind 2022 Intel Raptor Lake. The Lion Cove should be roughly on par with the M4 on this test.

New comment by pbsd in "You can't fool the optimizer"

pbsd — Wed, 03 Dec 2025 20:54:19 +0000

Because the function is not quite correct. It should be

    return n ? (1u  + popcount(n & n - 1u)) : 0u;

which both Clang and GCC promptly optimize to a single popcnt.

New comment by pbsd in "Solving Fizz Buzz with Cosines"

pbsd — Sat, 22 Nov 2025 02:30:28 +0000

This can be translated to the discrete domain pretty easily, just like the NTT. Pick a sufficiently large prime with order 15k, say, p = 2^61-1. 37 generates the whole multiplicative group, and 37^((2^61-2)/3) and 37^((2^61-2)/5) are appropriate roots of unity. Putting it all together yields

    f(n) = 5226577487551039623 + 1537228672809129301*(1669582390241348315^n + 636260618972345635^n) + 3689348814741910322*(725554454131936870^n + 194643636704778390^n + 1781303817082419751^n + 1910184110508252890^n) mod (2^61-1).

This involves 6 exponentiations by n with constant bases. Because in fizzbuzz the inputs are sequential, one can further precompute c^(2^i) and c^(-2^i) and, having c^n, one can go to c^(n+1) in average 2 modular multiplications by multiplying the appropriate powers c^(+-2^i) corresponding to the flipped bits.

New comment by pbsd in "C++26: range support for std:optional"

pbsd — Wed, 15 Oct 2025 23:58:27 +0000

No, the generated code seems to be mostly the same as the manual version: https://gcc.godbolt.org/z/aK8orbKE8

The main difference there seems to be that GCC treats the if() as unlikely to be taken while the for() as likely.

New comment by pbsd in "NSA and IETF: Can an attacker purchase standardization of weakened cryptography?"

pbsd — Sun, 05 Oct 2025 14:58:29 +0000

The SIKE comparison is not particularly inconsistent since Bernstein has been banging the drum that structured lattices may not be as secure as thought for years now.

Currently the best attacks on NTRU, Kyber, etc, are essentially the same generic attacks that work for something like Frodo, which works on unstructured lattices. And while the resistance of unstructured attacks is pretty well studied at this point, it is not unreasonable to suspect that the algebraic structure in the more efficient lattice schemes can lead to more efficient attacks. How efficient? Who knows.

New comment by pbsd in "Scream cipher"

pbsd — Sat, 20 Sep 2025 19:46:15 +0000

I thought this was gonna be about the actual Scream stream cipher: https://eprint.iacr.org/2002/019

New comment by pbsd in "The staff ate it later"

pbsd — Wed, 03 Sep 2025 01:45:41 +0000

Cimino's Heaven's Gate (1980) is usually pointed as the movie that caused the "no animals were harmed" disclaimer to be added to subsequent movies.

New comment by pbsd in "Test Results for AMD Zen 5"

pbsd — Sat, 26 Jul 2025 20:33:15 +0000

Vector ALU instruction latencies are understandably listed as 2 and higher, but this is not strictly the case. From AMD's Zen 5 optimization manual [1], we have

    The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
    There is no penalty for operations in the slow region that depend on longer latency operations or loads.
    There is no penalty for any operations in the fast region.
    To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
    The latency test could interleave NOPs to prevent the scheduler from filling up.

Basically, short vector code sequences that don't fill up the scheduler will have better latency.

[1] https://www.amd.com/content/dam/amd/en/documents/processor-t...

New comment by pbsd in "C++: Zero-cost static initialization"

pbsd — Sat, 19 Jul 2025 01:14:37 +0000

>Even after the static variable has been initialised, the overhead of accessing it is still considerable: a function call to __cxa_guard_acquire(), plus atomic_load_explicit(&__b_guard, memory_order::acquire) in __cxa_guard_acquire().

No. The lock calls are only done during initialization, in case two threads run the initialization concurrently while the guard variable is 0. Once the variable is initialized, this will always be skipped by "je .L3".

New comment by pbsd in "The ITTAGE indirect branch predictor"

pbsd — Sat, 05 Jul 2025 03:00:18 +0000

The Pentium 4 had branch hints in the form of taken/not taken prefixes. They were not found to be useful and basically ignored in every subsequent Intel microarchitecture, until Redwood Cove brought back the branch taken prefix in 2023.

New comment by pbsd in "Debunking NIST's calculation of the Kyber-512 security level (2023)"

pbsd — Sun, 22 Jun 2025 03:44:52 +0000

This circuit [1] puts it at <=135k bit operations. Bitcoin uses SHA-256, not SHA-1.

[1] https://nigelsmart.github.io/MPC-Circuits/sha256.txt

New comment by pbsd in "X X^t can be faster"

pbsd — Fri, 16 May 2025 17:52:16 +0000

Karatsuba is definitely faster than schoolbook multiplication at practical sizes. You presumably mean Strassen.

New comment by pbsd in "Chunking Attacks on File Backup Services Using Content-Deﬁned Chunking [pdf]"

pbsd — Fri, 21 Mar 2025 20:59:46 +0000

In page 10, should the ring R be GF(2)[X]/(X^32-1) and the map p be from {0,1}^{32} to R?

New comment by pbsd in "Learn How to Break AES"

pbsd — Tue, 04 Mar 2025 21:49:43 +0000

Interestingly enough, the Square attack (otherwise more generally known as integral cryptanalysis) is much more powerful than regular linear or differential cryptanalysis when applied to the AES.

New comment by pbsd in "Why Quantum Cryptanalysis is Bollocks [pdf]"

pbsd — Wed, 19 Feb 2025 22:31:21 +0000

Antoine Joux was on the side of classical cryptanalysis on a 2014 bet. This was right after the small-characteristic discrete log advances, so that might no longer be the bet if it was made today.

https://x.com/hashbreaker/status/494867301435318273

New comment by pbsd in "How do modern compilers choose which variables to put in registers?"

pbsd — Mon, 17 Feb 2025 20:22:05 +0000

Jasmin is something like this. It is essentially a high-level assembler, will handle register allocation (but not spills) for you, has some basic control flow primitives that map 1-to-1 to assembly instructions. There is also an optional formal verification component to prove some function is equivalent to its reference , is side-channel free, etc.

[1] https://github.com/jasmin-lang/jasmin/wiki

New comment by pbsd in "New speculative attacks on Apple CPUs"

pbsd — Tue, 28 Jan 2025 20:21:31 +0000

It goes way back; check the work of the likes of Thorsten Holz or Christof Paar. TU Graz is another one.

New comment by pbsd in "The Alder Lake anomaly, explained"

pbsd — Wed, 08 Jan 2025 21:34:38 +0000

Trying some perf events confirms that there is no extra inserted uop. Going back to the SHLX R[i], R[i], RCX loop, we have:

No anomaly:

     2,190,954,207      cpu_core/cycles:u/                                                      ( +-  0.14% )
     4,412,790,656      cpu_core/uops_issued.any:u/                                             ( +-  0.11% )
        39,386,389      cpu_core/exe_activity.1_ports_util:u/                                        ( +- 11.57% )
     2,121,401,346      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.11% )
         6,015,432      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  8.87% )
       593,599,670      cpu_core/uops_retired.stalls:u/                                         ( +-  0.85% )

Anomaly:

     4,357,567,336      cpu_core/cycles:u/                                                      ( +-  0.15% )
     4,448,899,140      cpu_core/uops_issued.any:u/                                             ( +-  0.26% )
     2,107,051,688      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  0.14% )
     1,106,699,503      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.13% )
     1,129,497,409      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.42% )
     2,502,226,997      cpu_core/uops_retired.stalls:u/                                         ( +-  0.38% )

Noise from the surrounding code aside, we see the same number of uops issued. However in the anomaly case, ~1/4th of the cycles are spent with no uops being executed, 1/2 are spent with only 1 uop being executed, and around 1/4 of the cycles have 2 uops being executed. I expected 0 and 2 being 50/50, consistently with there being one cycle stall, but if the uops are desynched and issued one cycle apart it would also explain the 1 being so prominent.

To confirm this I add an LFENCE at the start of each loop iteration to serialize the pipeline and try to ensure that each SHLX pair is issued in the same cycle. And it works:

     4,581,269,346      cpu_core/cycles:u/                                                      ( +-  0.10% )
     4,556,347,404      cpu_core/uops_issued.any:u/                                             ( +-  0.12% )
       133,363,872      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  7.73% )
     2,082,838,530      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.24% )
     2,165,817,614      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.06% )
     3,090,362,239      cpu_core/uops_retired.stalls:u/                                         ( +-  0.16% )

Now the uops are split between 0 and 2 executed per cycle, as theorized.

New comment by pbsd in "The Alder Lake anomaly, explained"

pbsd — Tue, 07 Jan 2025 20:23:14 +0000

Interleaving CQO and SHLX results in ~1.33 throughput with the anomaly, ~2.0 without. This ratio is more or less constant whether it's 1:1 or 2:2 or 4:4 or 8:8 (with 1:1 it's slightly lower at ~1.28).

This may or may not be consistent with one CQO uop being executed once a cycle as expected, and one SHLX uop taking a a spot (stalling for one cycle?) for 2 cycles, resulting in a runtime of (x/2 * 1 + x/2 * 2)/2 ~ x/1.33 cycles.

New comment by pbsd in "The Alder Lake anomaly, explained"

pbsd — Mon, 06 Jan 2025 13:45:03 +0000

Same framework but instead of, say, SHLX RAX, RAX, RCX I do SHLX R[i], R[i], RCX for 8 consecutive registers. Yes, it still does go to both ports.