Hacker News: shihab

New comment by shihab in "The United States and Israel have launched a major attack on Iran"

shihab — Sat, 28 Feb 2026 17:27:50 +0000

I brought up Israeli-American donors because that’s what is relevant in the context of the story we’re discussing. We are talking about a war many right wing Israelis wanted for decades. If it were a general discussion about Citizens United and I focused on lobbying from only this group, perhaps your argument would have carried water.

Anyway, here’s Trump himself detailing the extraordinary access to White House this lobbying bought Adelsons:

https://www.reuters.com/world/us/trump-salutes-mega-donor-mi...

New comment by shihab in "The United States and Israel have launched a major attack on Iran"

shihab — Sat, 28 Feb 2026 10:34:57 +0000

Exactly what part of my statement was dog whistling? Can you stop throwing around this serious accusation of antisemitism without any attempt to substantiate your claim?

New comment by shihab in "The United States and Israel have launched a major attack on Iran"

shihab — Sat, 28 Feb 2026 08:50:38 +0000

Citizens United is an existential threat for USA. You cannot have Israeli-American dual citizens pouring $200 million dollars in elections. and that’s just her alone. This is simply not sustainable.

New comment by shihab in "The United States and Israel have launched a major attack on Iran"

shihab — Sat, 28 Feb 2026 08:35:49 +0000

Another mid east war entirely on Israel’s behalf, another war Americans will pay tax for, die for- just so Israel can keep grabbing few parcels of lands from Palestine.

New comment by shihab in "The Waymo World Model"

shihab — Fri, 06 Feb 2026 17:04:07 +0000

I think there are two steps here: converting video to sensor data input, and using that sensor data to drive. Only the second step will be handled by cars on road, first one is purely for training.

New comment by shihab in "Jeffrey Epstein's Money Mingled with Silicon Valley Startups"

shihab — Thu, 05 Feb 2026 17:20:14 +0000

The article strictly talks about people who were pals with him _after_ his Pedophilia conviction. And please don't do this strawman "evil person eating babies", nobody sane is claiming that.

New comment by shihab in "Rust’s Standard Library on the GPU"

shihab — Wed, 28 Jan 2026 06:11:08 +0000

I work with GPUs and I'm also trying to understand the motivations here.

Side note & a hot take: that sort of abstraction never really existed for GPU and it's going to be even harder now as Nvidia et al races to put more & more specialized hardware bits inside GPUs

New comment by shihab in "Rust’s Standard Library on the GPU"

shihab — Wed, 28 Jan 2026 05:55:36 +0000

To the author (or anyone from vectorware team), can you please give me, admittedly a skeptic, a motivating example of a "GPU-native" application?

That is, where does it truly make a difference to dispatch non-parallel/syscalls etc from GPU to CPU instead of dispatching parallel part of a code from CPU to GPU?

From the "Announcing VectorWare" page:

> Even after opting in, the CPU is in control and orchestrates work on the GPU.

Isn't it better to let CPUs be in control and orchestrate things as GPUs have much smaller, dumber cores?

> Furthermore, if you look at the software kernels that run on the GPU they are simplistic with low cyclomatic complexity.

Again, there's a obvious reason why people don't put branch-y code on GPU.

Genuinely curious what I'm missing.

New comment by shihab in "SIMD programming in pure Rust"

shihab — Wed, 21 Jan 2026 23:30:24 +0000

> For example, NEON ... can hold up to 32 128-bit vectors to perform your operations without having to touch the "slow" memory.

Something I recently learnt: the actual number of physical registers in modern x86 CPUs are significantly larger, even for 512-bit SIMD. Zen 5 CPUs actually have 384 vectors registers, 384*512b = 24KB!

New comment by shihab in "Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks"

shihab — Wed, 21 Jan 2026 14:54:33 +0000

I'm not asking an academic program first published 8 year ago (e3nn) to beat actively developed CuEquivariance library. An academic proposing new algorithms doesn't need to worry too much about performance. But any new work which focuses on performance, that includes this blog and a huge number of academic papers published every year, should absolutely use latest vendor libraries as baseline.

New comment by shihab in "Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks"

shihab — Wed, 21 Jan 2026 14:24:39 +0000

I should note PETSc is a big piece of software that does a lot of things. It also wraps many libraries, and those might ultimately dictate actual performance depending on what you plan on doing.

New comment by shihab in "Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks"

shihab — Wed, 21 Jan 2026 14:10:53 +0000

To be practically useful, we don't need to beat vendors, just getting close would be enough, by the virtue of being open-source (and often portable). But I found, as an example, PETSc to be ~10x slower than MKL on CPU and CUDA on GPU; It still doesn't have native shared memory parallelism support on CPU etc.

New comment by shihab in "Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks"

shihab — Wed, 21 Jan 2026 13:00:50 +0000

Hi, I just wanted to note that e3nn is more of an academic software that's a bit high-level by design. A better baseline for comparison would be Nvidia's cuEquivariance, which does pretty much the same thing as you did- take e3nn and optimize it for GPU.

As a HPC developer, it breaks my heart how worse academic software performance is compared to vendor libraries (from Intel or Nvidia). We need to start aiming much higher.

New comment by shihab in "AVX-512: First Impressions on Performance and Programmability"

shihab — Mon, 19 Jan 2026 17:42:22 +0000

Hi, I actually mentioned ISPC several times there. And although I strenuously avoided crowning one approach "better" over the other, it is worth pointing out that 1) Many of these benefits of ISPC can be had from explicit SIMD libraries like Google's Highway, and 2) ISPC (or any SIMT model) is a departure from how the underlying hardware works, and as the AI community is discovering with GPU, this abstraction can sometimes be lot more headache than its worth.

New comment by shihab in "AVX-512: First Impressions on Performance and Programmability"

shihab — Mon, 19 Jan 2026 17:20:00 +0000

No. Assuming `k` is small enough, which in practice often is, the arithmetic intensity of this kernel is 25-90 Flops/Byte, way above the roofline knee of any modern CPU.

New comment by shihab in "AVX-512: First Impressions on Performance and Programmability"

shihab — Mon, 19 Jan 2026 17:17:50 +0000

Hi, thanks for reading.

Re (b) I'm curious what that middle ground is. Is there any simple refactor to help GCC to get rid of this `if`? (Note, ISPC did fine here)

Regarding (a), one of the points I wanted to get across was that it didn't feel that complicated to program in the end as I had thought. Porting to AVX-512 felt mechanical (hence the success of LLMs in one-shotting the whole thing).

This is a subjective opinion, depends on programmer's experience etc- so I won't dwell on it. I just wish more CPU programmers gave it a try.

New comment by shihab in "AVX-512: First Impressions on Performance and Programmability"

shihab — Wed, 14 Jan 2026 12:01:56 +0000

Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.

AVX-512: First Impressions on Performance and Programmability

shihab — Wed, 14 Jan 2026 00:43:36 +0000

Article URL: https://shihab-shahriar.github.io//blog/2026/AVX-512-First-Impressions-on-Performance-and-Programmability/

Comments URL: https://news.ycombinator.com/item?id=46610800

Points: 125

# Comments: 53

New comment by shihab in "A Couple 3D AABB Tricks"

shihab — Mon, 12 Jan 2026 19:57:23 +0000

For SIMD at least, the {mins[3], maxs[3]} representation aligns more naturally with actual instructions on x86. To compute a new bounding box:

new_box.mins = _mm_min_ps(a.mins[3], b.mins[3]);

New comment by shihab in "The unreasonable effectiveness of the Fourier transform"

shihab — Fri, 09 Jan 2026 01:03:48 +0000

If you are from ML/Data science world, the analogy that finally unlocked FFT for me is feature size reduction using Principal Component Analysis. In both cases, you project data to a new "better" co-ordinate system ("time to frequency domain"), filter out the basis vectors that have low variance ("ignore high-frequency waves"), and project data back to real space from those truncated dimension ("Ifft: inverse transform to time domain").

Of course some differences exist (e.g. basis vectors are fixed in FFT, unlike PCA).