<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: janwas</title><link>https://news.ycombinator.com/user?id=janwas</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 25 May 2026 20:28:09 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=janwas" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Oops, the final T got cut off somehow, sorry about that.<p><a href="https://gcc.godbolt.org/z/KM3ben7ET" rel="nofollow">https://gcc.godbolt.org/z/KM3ben7ET</a></p>
]]></description><pubDate>Mon, 25 May 2026 14:26:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=48267245</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48267245</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48267245</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Any suggestions for improvement? We went through >5 iterations of the dispatching and I am fairly confident this is about as good as it gets in current C++.
I suppose "macro hell" is a matter of taste. Objectively, we have six dispatch related macros in the example: <a href="https://gcc.godbolt.org/z/KM3ben7E" rel="nofollow">https://gcc.godbolt.org/z/KM3ben7E</a>
The ~two dozen lines of boilerplate are generally copied from an example.
But why multi-file?</p>
]]></description><pubDate>Wed, 20 May 2026 09:16:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=48205076</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48205076</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48205076</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Working on one together with fastcode.org :)</p>
]]></description><pubDate>Mon, 18 May 2026 12:45:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=48178973</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48178973</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48178973</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>To be clear, "better abstractions" here seems to mean macros for assembly language. To each their own.<p>What bothers me is advocating for this, or denigrating more generally useful alternatives, without mentioning the very narrow niche where this sits.<p>Video codecs only change every few years. This makes it more worthwhile/feasible to spend eng time on a few kernels.<p>Even then, not supporting SVE (you don't, right?) gives less incentive for the Arm CPU ecosystem to invest in it, helping keeping us stuck in the NEON local minimum. Not ideal :/</p>
]]></description><pubDate>Mon, 18 May 2026 07:58:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=48176670</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48176670</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48176670</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Correction (typo): Z13 lacks fp32.</p>
]]></description><pubDate>Mon, 18 May 2026 06:24:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=48176178</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48176178</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48176178</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Oh, interesting :) I meant Fastcode.org.</p>
]]></description><pubDate>Mon, 18 May 2026 06:23:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=48176171</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48176171</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48176171</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Is this a good faith reply? The particular abstraction we built, and is being discussed, is manifestly and obviously not a lowest common denominator.
Looks like you are deploying a second straw man, that of zero cost. In other comments here I acknowledge a cost to intrinsics.</p>
]]></description><pubDate>Sun, 17 May 2026 16:37:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=48170503</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48170503</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48170503</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>?? Where did you see mention of AI?</p>
]]></description><pubDate>Sun, 17 May 2026 16:13:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=48170232</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48170232</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48170232</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Thanks for sharing. The first link seems non public indeed.
I can imagine there is some compile issue we could reasonably fix, with the help of someone who has Z13 access. Please encourage them to raise an issue. I will be back on May 26.
After that, it should at least be able to use the scalar fallback.
The issue with Z14 is that it lacks fp32 support. Would their usage be integer only?</p>
]]></description><pubDate>Sun, 17 May 2026 16:12:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=48170217</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48170217</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48170217</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Fair point. If it helps, our security team has called Highway critical infrastructure and helped to harden the repo.
The flip side of standardization is that it would be much harder and slower to add ops as the need arises, which we do regularly.</p>
]]></description><pubDate>Sun, 17 May 2026 16:06:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=48170148</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48170148</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48170148</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>:) I figure there is always something left to improve. For some kernels which really want to keep 30+ live registers, the compiler might not do as good a job as careful manual tuning, so intrinsics can have a bit of a cost. But I also figure optimization time is limited, so better to get 90% of several kernels rather than one to 99%.</p>
]]></description><pubDate>Sun, 17 May 2026 16:02:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=48170107</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48170107</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48170107</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Yes, the EMU128 target is scalar only, with for loops. This is a fun way to see how well autovectorization works, with the same source code.
That works on any CPU. Curious which projects have such concerns, any link?</p>
]]></description><pubDate>Sun, 17 May 2026 10:48:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=48167746</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48167746</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48167746</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>In such discussions, whenever you mention abstractions are <i>universally</i> "pretty poor", to the extent anyone is listening, I think this hyperbole can do real damage. Maybe it prevents people from getting relevant performance gains, even if not 100% of the optimum, which is anyway unattainable. And what is the alternative? Not many projects can afford to hand write intrinsics for all platforms. And are you aware that Highway is basically a thin wrapper over intrinsics, which you can still drop down to where it helps?</p>
]]></description><pubDate>Sun, 17 May 2026 09:50:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=48167454</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48167454</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48167454</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>This works today :) Highway provides such an abstraction for arbitrary vector lengths and maps them to intrinsics. All on the library level, no need to wait years for compiler or language updates.</p>
]]></description><pubDate>Sun, 17 May 2026 09:39:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=48167411</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48167411</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48167411</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>:)
I agree a tutorial would be helpful. We are working on one with Fastcode.</p>
]]></description><pubDate>Sun, 17 May 2026 09:30:12 +0000</pubDate><link>https://news.ycombinator.com/item?id=48167368</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48167368</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48167368</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Have you considered our Highway library? Runtime dispatch need not be a PITA :) It's basically portable intrinsics, and a much more complete set (>300) than the ~50 in std.</p>
]]></description><pubDate>Sun, 17 May 2026 09:21:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=48167321</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=48167321</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48167321</guid></item><item><title><![CDATA[New comment by janwas in "C++26 Shipped a SIMD Library Nobody Asked For"]]></title><description><![CDATA[
<p>Highway TL here. I agree with the main points, with a few clarifications:<p>> tag-dispatched free functions like hn::Mul(d, a, b)<p>We only require tags for certain ops, mainly memory, casts and reduction; not arithmetic. Operator overloading is supported but until recently compilers didn't allow that for SVE vectors.<p>> It’s a Google project with Google-scale maintenance, but the bus factor is real — the core development is driven by a small team<p>We have 101 contributors, including 14 current or former Googlers in several teams.<p>> being length-agnostic means you can’t easily express fixed-width algorithms that depend on knowing the vector size at compile time, which is common in cryptography and codec work<p>We explicitly support fixed-length 128-bit vectors, acknowledging that these are common and important.</p>
]]></description><pubDate>Mon, 23 Mar 2026 06:46:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=47486150</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=47486150</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47486150</guid></item><item><title><![CDATA[New comment by janwas in "RISC-V Is Sloooow"]]></title><description><![CDATA[
<p>Looks like the ratification plan for Zvzip is November. So maybe 3y until HW is actually usable?
That's a neat trick with wmacc, congrats. But still, half the speed for quite a fundamental operation that has been heavily used in other ISAs for 20+ years :(<p>Great that you did a gap analysis [1]. I'm curious if one of the inputs for that was the list of Highway ops [2]?<p>[1]: <a href="https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3e4b626" rel="nofollow">https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3...</a>
[2]: <a href="https://github.com/google/highway/blob/master/g3doc/quick_reference.md" rel="nofollow">https://github.com/google/highway/blob/master/g3doc/quick_re...</a></p>
]]></description><pubDate>Sat, 14 Mar 2026 08:34:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=47374562</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=47374562</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47374562</guid></item><item><title><![CDATA[New comment by janwas in "RISC-V Is Sloooow"]]></title><description><![CDATA[
<p>(Personal opinion)
I get the impression that RISC-V-related discussions often lack of awareness of prior work/alternatives. A large amount of (x86) software actually uses our Highway library to run on whatever size vectors <i>and instructions</i> the CPU offers.<p>This works quite well in practice. As to leaving performance on the table, it seems RVV has some egregious performance differences/cliffs. For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?</p>
]]></description><pubDate>Fri, 13 Mar 2026 19:54:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=47368968</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=47368968</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47368968</guid></item><item><title><![CDATA[New comment by janwas in "I have written gemma3 inference in pure C"]]></title><description><![CDATA[
<p>:D Your code was nicely written and it was a pleasure to port to SIMD because it was already very data-parallel.</p>
]]></description><pubDate>Wed, 28 Jan 2026 20:28:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=46801069</link><dc:creator>janwas</dc:creator><comments>https://news.ycombinator.com/item?id=46801069</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46801069</guid></item></channel></rss>