Hacker News: ribit

New comment by ribit in "Trump to impose $100k fee for H-1B worker visas, White House says"

ribit — Sat, 20 Sep 2025 05:26:49 +0000

Yep. My wife just started as a professor (humanities) and she entered on H1B visa last week, as green card takes years to obtain. I have been offered a teaching job at the same institution as a partner hire and they have filed an H1B petition for me.

Unless they clarify that education is exempt from these rules, my wife will surely have to quit her new job. She is supposed to go on fieldwork later this year and she won’t be able to re-enter. Not to mention I can kiss my lecturer offer good bye. This is an incredibly retarded situation.

New comment by ribit in "Don't "optimize" conditional moves in shaders with mix()+step()"

ribit — Tue, 11 Feb 2025 14:45:33 +0000

Execution with masking is pretty much how broaching works on GPUs. What’s more relevant however is that conditional statements add overhead on terms of additional instructions and execution state management. Eliminating small branches using conditional moves or manual masking can be a performance win.

New comment by ribit in "What's Next for WebGPU"

ribit — Fri, 22 Nov 2024 14:56:51 +0000

Quick note: I looked at the bindless proposal linked from the blog post and their description of Metal is quite outdated. MTLArgumentEncoder has been deprecated for a while now, the layout is a transparent C struct that you populate at will with GPU addresses. There are still descriptors for textures and samplers, but these are hidden from the user (the API will maintain internal tables). It's a very convenient model and probably the simplest and most flexible of all current APIs. I'd love to see something similar for WebGPU.

New comment by ribit in "AAA Gaming on Asahi Linux"

ribit — Fri, 11 Oct 2024 13:02:55 +0000

No.

New comment by ribit in "AAA Gaming on Asahi Linux"

ribit — Fri, 11 Oct 2024 13:02:39 +0000

M3 GPU uses a new instruction encoding, among other things. Also, it has a new memory partitioning scheme (aka. Dynamic Caching), which probably requires a bunch of changes to both the driver interface and the shader compiler. I hope the Asahi team will get to publishing the details of M3 soon, I have been curious about this for a while.

New comment by ribit in "AAA Gaming on Asahi Linux"

ribit — Fri, 11 Oct 2024 07:21:43 +0000

Are you talking about Vulkan or about geometry shaders? The later is simple: because geometry shaders are a badly designed feature that sucks on modern GPUs. Apple has designed Metal to only support things that are actually fast. Their solution for geometry generation is mesh shaders, which is a modern and scalable feature that actually works.

If you are talking about Vulkan, that is much more complicated. My guess is that they want to maintain their independence as hardware and software innovator. Hard to do that if you are locked into a design by committee API. Apple has had some bad experience with these things in the past (e.g. they donated OpenCL to Kronos only to see it sabotaged by Nvidia). Also, Apple wanted a lean and easy to learn GPU API for their platform, and Vulkan is neither.

While their stance can be annoying to both developers and users, I think it can be understood at some level. My feelings about Vulkan are mixed at best. I don't think it is a very good API, and I think it makes too many unnessesary compromises. Compare for example the VK_EXT_descriptor_buffer and Apple's argument buffers. Vulkan's approach is extremely convoluted — you are required to query descriptor sizes at runtime and perform manual offset computation. Apple's implementation is just 64-bit handles/pointers and memcpy, extremely lean and immediately understandable to anyone with basic C experience. I understand that Vulkan needs to support different types of hardware where these details can differ. However, I do not understand why they have to penalize developer experience in order to support some crazy hardware with 256-byte data descriptors.

New comment by ribit in "AAA Gaming on Asahi Linux"

ribit — Fri, 11 Oct 2024 07:08:59 +0000

Apple not supporting Vulkan is a business decision. They wanted a lean and easy to learn API that they can quickly iterate upon, and they want you to optimize for their hardware. Vulkan does not cater to either of these goals.

Interestingly, Apple was on the list of the initial Vulkan backers — but they pulled out at some point before the first version was released. I suppose they saw the API moving in the direction they were not interested in. So far, their strategy has been a mixed bag. They failed to attract substantial developer interest, at the same time they delivered what I consider to be the best general-purpose GPU API around.

Regarding programmable tessellation, Apple's approach is mesh shaders. As far as I am aware, they are the only platform that offers standard mesh shader functionality across all devices.

New comment by ribit in "A new approach to error handling"

ribit — Wed, 31 Jul 2024 09:59:31 +0000

Have you looked at the Swift error model? I really like their design. They use a dedicated try statement to mark call sites that can fail — note that try is not the same as try...catch — Swift has an additional block construct for catching errors. This design makes sure that you always know where errors can occur when reading the program code, but avoids all the ergonomy issues you mention.

Your model sems to be every similar to the traditional implicit model used by languages such as C++, only that you allow switchign between the implicit and explicit error propagation. I am not sure how much this is useful in practice, as it creates inconsistency.

New comment by ribit in "A new approach to error handling"

ribit — Tue, 30 Jul 2024 18:31:17 +0000

Is this really a new approach? On a cursory look this seems like implicit error propagation with checked exceptions. I am Also curious about authors presentation of the topic. To me, an important feature of error handling design is whether fallible contexts are marked (e.g., with try statement) or not.

New comment by ribit in "Zen 5's 2-ahead branch predictor: how a 30 year old idea allows for new tricks"

ribit — Sat, 27 Jul 2024 17:47:59 +0000

While I understand the argument, it would also be good to see some empirical evidence. So far all x86 built need more power to reach the same performance level as ARM. Of course, Apple is still the outlier.

New comment by ribit in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

ribit — Thu, 13 Jun 2024 13:48:46 +0000

> Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

I understand, yes, it makes sense. Of course, other architectures can make other optimizations, like selecting warps that are more likely to have data ready etc., but Nvidia's implementation does sound like a very smart approach

> And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core

That is indeed a powerful technique

> It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread.

Yes, I meant executing multiple instructions from the same warp/thread concurrently, depending on the execution granularity of course. Executing instructions from different warps in the same block is slightly different, since warps don't need to be at the same execution state. Applying the CPU terminology, warp is more like a "CPU thread". It does seem like Nvidia indeed moved quite far into the SIMT direction and their threads/lanes can have independent program state. So I thin I can see the validity of your arguments that Nvidia can remap SIMD ALUs on the fly to suitable threads in order to achieve high hardware utilization.

> In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

Got it, thanks!

P.S. By the way, wanted to thank you for this very interesting conversation. I learned a lot.

New comment by ribit in "Demystifying NPUs: Questions and Answers"

ribit — Wed, 12 Jun 2024 12:33:29 +0000

Most NPUs are not directly end-user programmable. The vendor usually provides a custom SDK that allows you to run models created with popular frameworks on their NPUs. Apple is a good example since they have been doing it for a while. They provide a framework called CoreML and tools for converting ML models from frameworks such as PyTorch into a proprietary format that CoreML can work with.

The main reason for this lack of direct programmability is that NPUs are fast-evolving, optimized technology. Hiding the low-level interface allows the designer to change the hardware implementation without affecting end-user software. For example, some NPUs can only work with specific data formats or layer types. Early NPUs were very simple convolution engines based on DSPs; newer designs also have built-in support for common activation functions, normalization, and quantization.

Maybe one day, these things will mature enough to have a standard programming interface. I am skeptical about this becoming a reality any time soon. Some companies (like Tenstorrent) are specifically working on open architectures that will be directly programmable, I'm not sure whether their approach translates to the embedded NPUs, though. What would be nice is an open graph-based API and a model format for specifying and encoding ML models.

New comment by ribit in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

ribit — Wed, 12 Jun 2024 11:44:42 +0000

> Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Hm, the way I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep". In my understanding the role of the operand collector was to make sure that all register arguments are available before the instruction starts executing. If the operand collector needs multiple cycles to fetch the arguments from the register file, the instruction execution would stall.

So you are saying that my understanding is incorrect and that the instruction can be executed in multiple passes with different masks depending on which arguments are available? What is the benefit as opposed to stalling and executing the instruction only when all arguments are available? To me it seems like the end result is the same, and stalling is simpler and probably more energy efficient (if EUs are power-gated).

> But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.

Ah, that is interesting, so the operand collector provides a limited reordering capability to maximize hardware utilization, right? I must have missed that bit in the patent, that is a very smart idea.

> But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)? Many GPUs claim dual-issue capability, but that either refers to interleaved execution from different programs (Nvidia, Apple) or a SIMD-within SIMT or maybe even a form of long instruction word (AMD). If I remember correctly, Nvidia instructions contain some scheduling information that tells the scheduler when it is safe to issue the next instruction from the same wave after the previous one started execution. I don't know how others do it, probably via some static instruction timing information. Apple does have a very recent patent describing dependency detection in an in-order processor, no idea whether it is intended for the GPU or something else.

> you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

I think this is essentially what some architectures describe as the "register file cache". What is nice about Nvidia's approach is that it seems to be fully automatic and can really make the best use of a constrained register file.

New comment by ribit in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

ribit — Wed, 12 Jun 2024 11:25:38 +0000

Modern GPUs are exposing the SIMD behind the SIMT model and heavily investing into SIMD features such as shuffles, votes, and reduces. This leads to an interesting programming model. One interesting challenge is that flow control is done very differently on different hardware. AMD has a separate scalar instruction pipeline which can set the SIMD mask. Apple uses an interesting per-lane stack counter approach where value of zero means that the lane is active and non-zero value indicates how many blocks need to be exited for the thread to become active again. Not really sure how Nvidia does it.

New comment by ribit in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

ribit — Wed, 12 Jun 2024 05:29:59 +0000

In an operand-collector architecture the threads are still executed in lockstep. I don't think this makes the basic architecture less "SIMD-y". Operand collectors are a smart way to avoid multi-ported register files, which enables more compact implementation. Different vendors use different approaches to achieve a similar result. Nvidia uses operand collectors, Apple uses explicit cache control flags etc.

> This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet. The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

> That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

It is my understanding that you need to synchronize threads when accessing shared memory. Not only different threads can execute on different SIMD, but also threads on the same SIMD can access shared memory over multiple cycles on some architectures. I do not see how thread synthconization relates to operand collectors.

New comment by ribit in "SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)"

ribit — Tue, 11 Jun 2024 21:27:10 +0000

How would you envision that working at the hardware level? GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible. SIMD is a natural way to implement this. In real world, SIMT is just SIMD with some additional capabilities for control flow and a programming model that focuses on SIMD lanes as threads of execution.

What’s interesting is that modern SIMT is exposing quite a lot of its SIMD underpinnings, because that allows you to implement things much more efficiently. A hardware-accelerated SIMD sum is way faster than adding values in shared memory.

New comment by ribit in "Apple's On-Device and Server Foundation Models"

ribit — Tue, 11 Jun 2024 19:17:37 +0000

You need to consider this in the context of the relevant task. Nvidia GPUs have extremely high peak performance for GEMM, but when working with LLMs, bandwidth (and RAM capacity) becomes the limiting factor. There is a reason why real ML-focused datacenter Nvidia GPUs use much wider RAM interfaces and a much higher price point. The M2 Ultra might not have the raw compute, but it has a lot of RAM and large caches.

New comment by ribit in "Instruction Sets Should Be Free: The Case for RISC-V [pdf] (2014)"

ribit — Sat, 08 Jun 2024 12:04:39 +0000

I remember last year (?) Quancomm proposing an ISA extension that brings ARM-like addressing modes and paired stores to RISC-V, and the community reaction being very negative. Happy to hear that there are now initiatives to streamline these proposals and make RISC-V a better fit for high-performance CPUs. I am looking forward to future developments!

New comment by ribit in "Instruction Sets Should Be Free: The Case for RISC-V [pdf] (2014)"

ribit — Sat, 08 Jun 2024 07:32:49 +0000

I fully support the idea of open instruction sets. I am not as much sold on the idea of cookie-cutter one-size-fits-all instruction sets. RISC-V is very nice for teaching CPU basics, and it is a great fit for tiny cores or specialized microcontrollers. Unfortunately, since it has been designed for simplicity it appears that it makes it harder building high-performance cores. RISC-V philosophy for high-performance OoO cores relies on instruction fusion, and thus would require the compiler to emit fusion-friendly sequences for best performance - and these sequences might differ from CPU to CPU. To me this seems to go against the very idea of common open ISA. We already see quite a lot of fragmentation and I fear it will only get worse as time goes on. More complex instructions that combine multiple processing steps would help, it seems that the core RISC-V community is opposed to that idea out of purely ideological reasons.

New comment by ribit in "Vulkan1.3 on the M1 in one month"

ribit — Thu, 06 Jun 2024 07:51:16 +0000

Bugs notwithstanding (which I agree are a significant concern for Metal), I'd frankly much prefer to work with a well-designed, streamlined API like Metal instead of a needlesly verbose and complex Vulkan.