<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: pbsd</title><link>https://news.ycombinator.com/user?id=pbsd</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sun, 12 Apr 2026 08:52:24 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=pbsd" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by pbsd in "How many branches can your CPU predict?"]]></title><description><![CDATA[
<p>I mean, he's comparing 2024 Zen 5 and M4 against two generations behind 2022 Intel Raptor Lake. The Lion Cove should be roughly on par with the M4 on this test.</p>
]]></description><pubDate>Thu, 19 Mar 2026 18:35:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=47443830</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=47443830</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47443830</guid></item><item><title><![CDATA[New comment by pbsd in "You can't fool the optimizer"]]></title><description><![CDATA[
<p>Because the function is not quite correct. It should be<p><pre><code>    return n ? (1u  + popcount(n & n - 1u)) : 0u;
</code></pre>
which both Clang and GCC promptly optimize to a single popcnt.</p>
]]></description><pubDate>Wed, 03 Dec 2025 20:54:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=46140001</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=46140001</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46140001</guid></item><item><title><![CDATA[New comment by pbsd in "Solving Fizz Buzz with Cosines"]]></title><description><![CDATA[
<p>This can be translated to the discrete domain pretty easily, just like the NTT. Pick a sufficiently large prime with order 15k, say, p = 2^61-1. 37 generates the whole multiplicative group, and 37^((2^61-2)/3) and 37^((2^61-2)/5) are appropriate roots of unity. Putting it all together yields<p><pre><code>    f(n) = 5226577487551039623 + 1537228672809129301*(1669582390241348315^n + 636260618972345635^n) + 3689348814741910322*(725554454131936870^n + 194643636704778390^n + 1781303817082419751^n + 1910184110508252890^n) mod (2^61-1).
</code></pre>
This involves 6 exponentiations by n with constant bases. Because in fizzbuzz the inputs are sequential, one can further precompute c^(2^i) and c^(-2^i) and, having c^n, one can go to c^(n+1) in average 2 modular multiplications by multiplying the appropriate powers c^(+-2^i) corresponding to the flipped bits.</p>
]]></description><pubDate>Sat, 22 Nov 2025 02:30:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=46011529</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=46011529</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46011529</guid></item><item><title><![CDATA[New comment by pbsd in "C++26: range support for std:optional"]]></title><description><![CDATA[
<p>No, the generated code seems to be mostly the same as the manual version: <a href="https://gcc.godbolt.org/z/aK8orbKE8" rel="nofollow">https://gcc.godbolt.org/z/aK8orbKE8</a><p>The main difference there seems to be that GCC treats the if() as unlikely to be taken while the for() as likely.</p>
]]></description><pubDate>Wed, 15 Oct 2025 23:58:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=45599873</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=45599873</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45599873</guid></item><item><title><![CDATA[New comment by pbsd in "NSA and IETF: Can an attacker purchase standardization of weakened cryptography?"]]></title><description><![CDATA[
<p>The SIKE comparison is not particularly inconsistent since Bernstein has been banging the drum that structured lattices may not be as secure as thought for years now.<p>Currently the best attacks on NTRU, Kyber, etc, are essentially the same generic attacks that work for something like Frodo, which works on unstructured lattices. And while the resistance of unstructured attacks is pretty well studied at this point, it is not unreasonable to suspect that the algebraic structure in the more efficient lattice schemes can lead to more efficient attacks. How efficient? Who knows.</p>
]]></description><pubDate>Sun, 05 Oct 2025 14:58:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=45481985</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=45481985</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45481985</guid></item><item><title><![CDATA[New comment by pbsd in "Scream cipher"]]></title><description><![CDATA[
<p>I thought this was gonna be about the actual Scream stream cipher: <a href="https://eprint.iacr.org/2002/019" rel="nofollow">https://eprint.iacr.org/2002/019</a></p>
]]></description><pubDate>Sat, 20 Sep 2025 19:46:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=45316698</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=45316698</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45316698</guid></item><item><title><![CDATA[New comment by pbsd in "The staff ate it later"]]></title><description><![CDATA[
<p>Cimino's Heaven's Gate (1980) is usually pointed as the movie that caused the "no animals were harmed" disclaimer to be added to subsequent movies.</p>
]]></description><pubDate>Wed, 03 Sep 2025 01:45:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=45111404</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=45111404</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45111404</guid></item><item><title><![CDATA[New comment by pbsd in "Test Results for AMD Zen 5"]]></title><description><![CDATA[
<p>Vector ALU instruction latencies are understandably listed as 2 and higher, but this is not strictly the case. From AMD's Zen 5 optimization manual [1], we have<p><pre><code>    The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
    There is no penalty for operations in the slow region that depend on longer latency operations or loads.
    There is no penalty for any operations in the fast region.
    To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
    The latency test could interleave NOPs to prevent the scheduler from filling up.
</code></pre>
Basically, short vector code sequences that don't fill up the scheduler will have better latency.<p>[1] <a href="https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/software-optimization-guides/58455.zip" rel="nofollow">https://www.amd.com/content/dam/amd/en/documents/processor-t...</a></p>
]]></description><pubDate>Sat, 26 Jul 2025 20:33:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=44696747</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=44696747</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44696747</guid></item><item><title><![CDATA[New comment by pbsd in "C++: Zero-cost static initialization"]]></title><description><![CDATA[
<p>>Even after the static variable has been initialised, the overhead of accessing it is still considerable: a function call to __cxa_guard_acquire(), plus atomic_load_explicit(&__b_guard, memory_order::acquire) in __cxa_guard_acquire().<p>No. The lock calls are only done during initialization, in case two threads run the initialization concurrently while the guard variable is 0. Once the variable is initialized, this will always be skipped by "je      .L3".</p>
]]></description><pubDate>Sat, 19 Jul 2025 01:14:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=44611640</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=44611640</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44611640</guid></item><item><title><![CDATA[New comment by pbsd in "The ITTAGE indirect branch predictor"]]></title><description><![CDATA[
<p>The Pentium 4 had branch hints in the form of taken/not taken prefixes. They were not found to be useful and basically ignored in every subsequent Intel microarchitecture, until Redwood Cove brought back the branch taken prefix in 2023.</p>
]]></description><pubDate>Sat, 05 Jul 2025 03:00:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=44469816</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=44469816</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44469816</guid></item><item><title><![CDATA[New comment by pbsd in "Debunking NIST's calculation of the Kyber-512 security level (2023)"]]></title><description><![CDATA[
<p>This circuit [1] puts it at <=135k bit operations. Bitcoin uses SHA-256, not SHA-1.<p>[1] <a href="https://nigelsmart.github.io/MPC-Circuits/sha256.txt" rel="nofollow">https://nigelsmart.github.io/MPC-Circuits/sha256.txt</a></p>
]]></description><pubDate>Sun, 22 Jun 2025 03:44:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=44343350</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=44343350</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44343350</guid></item><item><title><![CDATA[New comment by pbsd in "X X^t can be faster"]]></title><description><![CDATA[
<p>Karatsuba is definitely faster than schoolbook multiplication at practical sizes. You presumably mean Strassen.</p>
]]></description><pubDate>Fri, 16 May 2025 17:52:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=44008124</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=44008124</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44008124</guid></item><item><title><![CDATA[New comment by pbsd in "Chunking Attacks on File Backup Services Using Content-Deﬁned Chunking [pdf]"]]></title><description><![CDATA[
<p>In page 10, should the ring R be GF(2)[X]/(X^32-1) and the map p be from {0,1}^{32} to R?</p>
]]></description><pubDate>Fri, 21 Mar 2025 20:59:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=43440767</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=43440767</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43440767</guid></item><item><title><![CDATA[New comment by pbsd in "Learn How to Break AES"]]></title><description><![CDATA[
<p>Interestingly enough, the Square attack (otherwise more generally known as integral cryptanalysis) is much more powerful than regular linear or differential cryptanalysis when applied to the AES.</p>
]]></description><pubDate>Tue, 04 Mar 2025 21:49:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=43260150</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=43260150</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43260150</guid></item><item><title><![CDATA[New comment by pbsd in "Why Quantum Cryptanalysis is Bollocks [pdf]"]]></title><description><![CDATA[
<p>Antoine Joux was on the side of classical cryptanalysis on a 2014 bet. This was right after the small-characteristic discrete log advances, so that might no longer be the bet if it was made today.<p><a href="https://x.com/hashbreaker/status/494867301435318273" rel="nofollow">https://x.com/hashbreaker/status/494867301435318273</a></p>
]]></description><pubDate>Wed, 19 Feb 2025 22:31:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=43108600</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=43108600</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43108600</guid></item><item><title><![CDATA[New comment by pbsd in "How do modern compilers choose which variables to put in registers?"]]></title><description><![CDATA[
<p>Jasmin is something like this. It is essentially a high-level assembler, will handle register allocation (but not spills) for you, has some basic control flow primitives that map 1-to-1 to assembly instructions. There is also an optional formal verification component to prove some function is equivalent to its reference , is side-channel free, etc.<p>[1] <a href="https://github.com/jasmin-lang/jasmin/wiki">https://github.com/jasmin-lang/jasmin/wiki</a></p>
]]></description><pubDate>Mon, 17 Feb 2025 20:22:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=43082958</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=43082958</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43082958</guid></item><item><title><![CDATA[New comment by pbsd in "New speculative attacks on Apple CPUs"]]></title><description><![CDATA[
<p>It goes way back; check the work of the likes of Thorsten Holz or Christof Paar. TU Graz is another one.</p>
]]></description><pubDate>Tue, 28 Jan 2025 20:21:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=42857454</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=42857454</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42857454</guid></item><item><title><![CDATA[New comment by pbsd in "The Alder Lake anomaly, explained"]]></title><description><![CDATA[
<p>Trying some perf events confirms that there is no extra inserted uop. Going back to the SHLX R[i], R[i], RCX loop, we have:<p>No anomaly:<p><pre><code>     2,190,954,207      cpu_core/cycles:u/                                                      ( +-  0.14% )
     4,412,790,656      cpu_core/uops_issued.any:u/                                             ( +-  0.11% )
        39,386,389      cpu_core/exe_activity.1_ports_util:u/                                        ( +- 11.57% )
     2,121,401,346      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.11% )
         6,015,432      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  8.87% )
       593,599,670      cpu_core/uops_retired.stalls:u/                                         ( +-  0.85% )
</code></pre>
Anomaly:<p><pre><code>     4,357,567,336      cpu_core/cycles:u/                                                      ( +-  0.15% )
     4,448,899,140      cpu_core/uops_issued.any:u/                                             ( +-  0.26% )
     2,107,051,688      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  0.14% )
     1,106,699,503      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.13% )
     1,129,497,409      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.42% )
     2,502,226,997      cpu_core/uops_retired.stalls:u/                                         ( +-  0.38% )
</code></pre>
Noise from the surrounding code aside, we see the same number of uops issued. However in the anomaly case, ~1/4th of the cycles are spent with no uops being executed, 1/2 are spent with only 1 uop being executed, and around 1/4 of the cycles have 2 uops being executed. I expected 0 and 2 being 50/50, consistently with there being one cycle stall, but if the uops are desynched and issued one cycle apart it would also explain the 1 being so prominent.<p>To confirm this I add an LFENCE at the start of each loop iteration to serialize the pipeline and try to ensure that each SHLX pair is issued in the same cycle. And it works:<p><pre><code>     4,581,269,346      cpu_core/cycles:u/                                                      ( +-  0.10% )
     4,556,347,404      cpu_core/uops_issued.any:u/                                             ( +-  0.12% )
       133,363,872      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  7.73% )
     2,082,838,530      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.24% )
     2,165,817,614      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.06% )
     3,090,362,239      cpu_core/uops_retired.stalls:u/                                         ( +-  0.16% )
</code></pre>
Now the uops are split between 0 and 2 executed per cycle, as theorized.</p>
]]></description><pubDate>Wed, 08 Jan 2025 21:34:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=42638695</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=42638695</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42638695</guid></item><item><title><![CDATA[New comment by pbsd in "The Alder Lake anomaly, explained"]]></title><description><![CDATA[
<p>Interleaving CQO and SHLX results in ~1.33 throughput with the anomaly, ~2.0 without. This ratio is more or less constant whether it's 1:1 or 2:2 or 4:4 or 8:8 (with 1:1 it's slightly lower at ~1.28).<p>This may or may not be consistent with one CQO uop being executed once a cycle as expected, and one SHLX uop taking a a spot (stalling for one cycle?) for 2 cycles, resulting in a runtime of (x/2 * 1 + x/2 * 2)/2 ~ x/1.33 cycles.</p>
]]></description><pubDate>Tue, 07 Jan 2025 20:23:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=42626996</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=42626996</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42626996</guid></item><item><title><![CDATA[New comment by pbsd in "The Alder Lake anomaly, explained"]]></title><description><![CDATA[
<p>Same framework but instead of, say, SHLX RAX, RAX, RCX I do SHLX R[i], R[i], RCX for 8 consecutive registers. Yes, it still does go to both ports.</p>
]]></description><pubDate>Mon, 06 Jan 2025 13:45:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=42610492</link><dc:creator>pbsd</dc:creator><comments>https://news.ycombinator.com/item?id=42610492</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42610492</guid></item></channel></rss>