<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: janwas</title><link>https://news.ycombinator.com/user?id=janwas</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 06 Apr 2026 05:41:16 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=janwas" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Highway TL here. I agree with the main points, with a few clarifications:<p>> tag-dispatched free functions like hn::Mul(d, a, b)<p>We only require tags for certain ops, mainly memory, casts and reduction; not arithmetic. Operator overloading is supported but until recently compilers didn't allow that for SVE vectors.<p>> It’s a Google project with Google-scale maintenance, but the bus factor is real — the core development is driven by a small team<p>We have 101 contributors, including 14 current or former Googlers in several teams.<p>> being length-agnostic means you can’t easily express fixed-width algorithms that depend on knowing the vector size at compile time, which is common in cryptography and codec work<p>We explicitly support fixed-length 128-bit vectors, acknowledging that these are common and important.</p>
]]></description><pubDate>Mon, 23 Mar 2026 06:46:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=47486150</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=47486150</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47486150</guid></item><item><title><![CDATA[New comment by janwas in "RISC-V Is Sloooow"]]></title><description><![CDATA[
<p>Looks like the ratification plan for Zvzip is November. So maybe 3y until HW is actually usable?
That's a neat trick with wmacc, congrats. But still, half the speed for quite a fundamental operation that has been heavily used in other ISAs for 20+ years :(<p>Great that you did a gap analysis [1]. I'm curious if one of the inputs for that was the list of Highway ops [2]?<p>[1]: <a href="https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3e4b626" rel="nofollow">https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3...</a>
[2]: <a href="https://github.com/google/highway/blob/master/g3doc/quick_reference.md" rel="nofollow">https://github.com/google/highway/blob/master/g3doc/quick_re...</a></p>
]]></description><pubDate>Sat, 14 Mar 2026 08:34:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=47374562</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=47374562</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47374562</guid></item><item><title><![CDATA[New comment by janwas in "RISC-V Is Sloooow"]]></title><description><![CDATA[
<p>(Personal opinion)
I get the impression that RISC-V-related discussions often lack of awareness of prior work/alternatives. A large amount of (x86) software actually uses our Highway library to run on whatever size vectors <i>and instructions</i> the CPU offers.<p>This works quite well in practice. As to leaving performance on the table, it seems RVV has some egregious performance differences/cliffs. For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?</p>
]]></description><pubDate>Fri, 13 Mar 2026 19:54:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=47368968</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=47368968</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47368968</guid></item><item><title><![CDATA[New comment by janwas in "I have written gemma3 inference in pure C"]]></title><description><![CDATA[
<p>:D Your code was nicely written and it was a pleasure to port to SIMD because it was already very data-parallel.</p>
]]></description><pubDate>Wed, 28 Jan 2026 20:28:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=46801069</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=46801069</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46801069</guid></item><item><title><![CDATA[New comment by janwas in "CES 2026: Taking the Lids Off AMD's Venice and MI400 SoCs"]]></title><description><![CDATA[
<p>Gemma.cpp has nested thread pools, one per chiplet, and one across all chiplets.
With such core counts it is quite important to minimize any kind of sharing, even RMW atomics.</p>
]]></description><pubDate>Sat, 10 Jan 2026 18:07:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=46568265</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=46568265</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46568265</guid></item><item><title><![CDATA[New comment by janwas in "The state of SIMD in Rust in 2025"]]></title><description><![CDATA[
<p>> performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether<p>A quick comment on this one point (personal opinion): from a hyperscalar perspective, scalar code is most certainly not enough. The energy cost from scheduling a MUL instruction is something like 10x of the actual operation it performs. It is important to amortize that cost over many elements (i.e. SIMD).</p>
]]></description><pubDate>Thu, 06 Nov 2025 09:00:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=45833030</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=45833030</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45833030</guid></item><item><title><![CDATA[New comment by janwas in "The Dragon Hatchling: The missing link between the transformer and brain models"]]></title><description><![CDATA[
<p>Wow, that number requires STRONG caveats, lest it be called out as completely false.
Take away the tensor cores (unless you only do matmuls?), and an H100 has roughly 2x as many f32 flops as a Zen5 CPU, which is considerably cheaper. I suspect brute force HW/algorithms are not going to age well: <a href="https://www.sigarch.org/dont-put-all-your-tensors-in-one-basket-hardware-lottery/" rel="nofollow">https://www.sigarch.org/dont-put-all-your-tensors-in-one-bas...</a>
(/personal opinion)</p>
]]></description><pubDate>Sun, 26 Oct 2025 15:15:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=45712546</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=45712546</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45712546</guid></item><item><title><![CDATA[New comment by janwas in "Java 25's new CPU-Time Profiler"]]></title><description><![CDATA[
<p>CPU-time would over-emphasize regions where many threads are running, right?
I find wall-time useful for finding serial regions that aren't yet parallelized.<p>More detail here: <a href="https://github.com/dvyukov/perf-load" rel="nofollow">https://github.com/dvyukov/perf-load</a>. We recently implemented the same idea without requiring context-switch events: <a href="https://github.com/google/highway/blob/master/hwy/profiler.h#L283" rel="nofollow">https://github.com/google/highway/blob/master/hwy/profiler.h...</a></p>
]]></description><pubDate>Sun, 14 Sep 2025 08:49:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=45238478</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=45238478</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45238478</guid></item><item><title><![CDATA[New comment by janwas in "SIMD Perlin Noise: Beating the Compiler with SSE (2014)"]]></title><description><![CDATA[
<p>:)
Yes indeed, it's about 500 LOC in <a href="https://github.com/google/highway/blob/master/hwy/ops/generic_ops-inl.h#L4877">https://github.com/google/highway/blob/master/hwy/ops/generi...</a>.</p>
]]></description><pubDate>Fri, 25 Jul 2025 17:57:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=44686140</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44686140</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44686140</guid></item><item><title><![CDATA[New comment by janwas in "SIMD Perlin Noise: Beating the Compiler with SSE (2014)"]]></title><description><![CDATA[
<p>I appreciate your efforts to nudge readers towards SoA data structures and varying SIMD widths. FWIW I have observed that advice is more effective if communicated with some kindness.</p>
]]></description><pubDate>Fri, 25 Jul 2025 07:57:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=44680775</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44680775</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44680775</guid></item><item><title><![CDATA[New comment by janwas in "SIMD Perlin Noise: Beating the Compiler with SSE (2014)"]]></title><description><![CDATA[
<p>Thanks for expanding on your viewpoint.<p>> Why would writing an optimizing compiler qualify as territory for directly writing SIMD code, but anything else is off the table?<p>I understood "directly writing" to mean assembly or even intrinsics. In general, I would advise not touching intrinsics directly, because the intrinsic definitions themselves have in several cases had bugs. Here's one AVX-512 example: <a href="https://github.com/google/highway/commit/7da2b760c012db04103d96a930e9c1cb445667ca">https://github.com/google/highway/commit/7da2b760c012db04103...</a>.<p>When using a wrapper library, these can be fixed in one spot, but every direct user of intrinsics has to deal with it themselves.<p>> it's extremely easy to write your own wrappers (like I did) and not take a dependency. A good trade IMO<p>I understand wanting to reduce dependencies. The tradeoff is a bit more complex: for example many readers would be familiar with Highway terminology. We have also made efforts to be a lightweight dependency :)<p>> doing it yourself as a learning exercise has value.<p>Understandable :) Though it's a bit regrettable to tie your user code to the library prototype - if used elsewhere, it would probably have to be ported.<p>> The fact is, the library code is super fucking boring.<p>True for many ops. However, emulating AES or other complex ops is nontrivial. And it is easy to underestimate the sheer toil of keeping things working across compiler versions and their many bugs. We recently hit the 3000 commit mark in Highway :)</p>
]]></description><pubDate>Fri, 25 Jul 2025 07:54:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=44680752</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44680752</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44680752</guid></item><item><title><![CDATA[New comment by janwas in "Algorithms for Modern Processor Architectures"]]></title><description><![CDATA[
<p>You also get the automatic support for newer instructions (and multiversioning) with a wrapper library such as our Highway :)</p>
]]></description><pubDate>Thu, 24 Jul 2025 08:24:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=44668364</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44668364</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44668364</guid></item><item><title><![CDATA[New comment by janwas in "SIMD Perlin Noise: Beating the Compiler with SSE (2014)"]]></title><description><![CDATA[
<p>Highway author here :) I'm curious what you disagree with, because it all sounds very sensible to me?</p>
]]></description><pubDate>Thu, 24 Jul 2025 07:50:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=44668114</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44668114</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44668114</guid></item><item><title><![CDATA[New comment by janwas in "The messy reality of SIMD (vector) functions"]]></title><description><![CDATA[
<p>Makes sense :) Generic or fallback versions are also useful for correctness testing and benchmarking.</p>
]]></description><pubDate>Mon, 07 Jul 2025 18:09:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=44493097</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44493097</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44493097</guid></item><item><title><![CDATA[New comment by janwas in "The messy reality of SIMD (vector) functions"]]></title><description><![CDATA[
<p>I made the same argument a while ago but a coworker changed my mind.<p>Can you afford to write <i>and maintain</i> a codepath per ISA (knowing that more keep coming, including RVV, LASX and HVX), to squeeze out the last X%? Is there no higher-impact use of developer time? If so, great.<p>If not, what's the alternative - scalar code? I'd think decent portable SIMD code is still better than nothing, and nothing (scalar) is all we have for new ISAs which have not yet been hand-optimized. So it seems we should anyway have a generic SIMD path, in addition to any hand-optimized specializations.<p>BTW, Highway indeed provides decent emulations of LD2..4, and at least 2-table lookups. Note that some Arm uarchs are anyway slow with 3 and 4.</p>
]]></description><pubDate>Mon, 07 Jul 2025 14:23:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=44490698</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44490698</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44490698</guid></item><item><title><![CDATA[New comment by janwas in "The messy reality of SIMD (vector) functions"]]></title><description><![CDATA[
<p>Mature such wrappers exist, for example our Highway library :)</p>
]]></description><pubDate>Sun, 06 Jul 2025 11:32:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=44479798</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44479798</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44479798</guid></item><item><title><![CDATA[New comment by janwas in "The messy reality of SIMD (vector) functions"]]></title><description><![CDATA[
<p>If SIMT is so obviously the right path, why have just about all GPU vendors and standards reinvented SIMD, calling it subgroups (Vulkan), __shfl_sync (CUDA), work group/sub-group (OpenCL), wave intrinsics (HLSL), I think also simdgroup (Metal)?</p>
]]></description><pubDate>Sun, 06 Jul 2025 11:31:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=44479787</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44479787</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44479787</guid></item><item><title><![CDATA[New comment by janwas in "The messy reality of SIMD (vector) functions"]]></title><description><![CDATA[
<p>I don't understand why it helps to "avoid them" entirely. For the (in my experience) >90% of shared code, we can gain the convenience of the wrapper library. For the rest, Highway allows target-specific specializations amidst your otherwise portable code: `#if HWY_TARGET == HWY_AVX2 ...`</p>
]]></description><pubDate>Sun, 06 Jul 2025 11:24:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=44479759</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44479759</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44479759</guid></item><item><title><![CDATA[New comment by janwas in "The messy reality of SIMD (vector) functions"]]></title><description><![CDATA[
<p>I know dzaima is aware, but for all the other posters who might not be, our Highway library provides all these missing instructions, via emulation if required.<p>I do not understand why folks are still making do with direct use of intrinsics or compiler builtins. Having a library centralize workarounds (such an an MSAN compiler change which hit us last week) seems like an obvious win.</p>
]]></description><pubDate>Sun, 06 Jul 2025 11:20:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=44479729</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=44479729</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44479729</guid></item><item><title><![CDATA[New comment by janwas in "Fundamental flaws of SIMD ISAs (2021)"]]></title><description><![CDATA[
<p>Nice work :) Clang x86 indeed unrolls, which is good. But setting the CC and AA mask constants looks fairly expensive compared to fixed-pattern shuffles.<p>Yes, the 2D aspect of the sorting network complicates things. Transposing is already harder to make VLA and fusing it with the other shuffles certainly doesn't help.</p>
]]></description><pubDate>Sat, 26 Apr 2025 17:29:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=43805494</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=43805494</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43805494</guid></item></channel></rss>