Hacker News: ColonelPhantom

Supporting ATI TeraScale GPUs from 2007-2009 in RPCS3

ColonelPhantom — Tue, 21 Jul 2026 20:55:21 +0000

Article URL: https://blog.rpcs3.net/2026/07/21/supporting-terascale-gpus/

Comments URL: https://news.ycombinator.com/item?id=48998170

Points: 5

# Comments: 1

New comment by ColonelPhantom in "How I use HTMX with Go"

ColonelPhantom — Wed, 15 Jul 2026 09:05:58 +0000

Or HOGS? HTMX-OS-Go-Sqlite. While having "OS" in there is kind of redundant, it does make for a nice and general acronym.

New comment by ColonelPhantom in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"

ColonelPhantom — Mon, 15 Jun 2026 23:22:50 +0000

Which model class requires an 80 GB VRAM GPU? From my perspective, popular models seem to be either in the ~30B range (Qwen3.6, Gemma 4), while the larger models (MiniMax, MiMo, StepFun, Deepseek) are in the multiple hundreds of billions parameters, for which 80 GB is simply too small.

You can just about reach the lower end of the latter category with a 128GB machine like a DGX Spark, Framework Desktop, or M5 Max, though those are usually not super fast. For the former category, you can easily run them fast with something like a 3090 or 5090, hell, probably even a 5060 Ti.

New comment by ColonelPhantom in "Nvidia is proposing a beast of a CPU system for Windows PCs"

ColonelPhantom — Sat, 06 Jun 2026 21:42:49 +0000

> the most likely experts

Is that how MoEs work? I though that an important constraint for MoEs is that experts need to be uniformly used to make sure they can be used effectively. If there is a 'common subset' that, if anything, sounds like a symptom of undertraining (i.e. the same trick will not work as well for Deepseek V4.1).

Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!

Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal; but I would like to try what you say with llama.cpp which uses mmap to also potentially do SSD streaming. (I can maybe try the large Qwen3.5 MoEs?)

> as context length increases

What kind of context length do you consider reasonable, though? From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens. So realistically, limiting context size might even improve quality, especially if you use token-efficient harnesses.

> Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.

Your point about consumer hardware was that it would be "borderline unusable" when running Qwen 3.6 27B. However, you need much less hardware to run a 27B than DSv4 Flash. In addition, you can do the same 'trick' with low-end GPUs and small MoEs: my desktop with 32 GB DDR4-3200 and an RTX 2070 8GB can run the ~30B class MoEs at 20-30 tokens per second and similar speeds to my laptop.

New comment by ColonelPhantom in "Nvidia is proposing a beast of a CPU system for Windows PCs"

ColonelPhantom — Sat, 06 Jun 2026 20:45:48 +0000

Deepseek V4 Flash still has 13B active params though? That is about half as many as Qwen3.6-27B (and much more than Qwen3.6-35B-A3B). Given that RAM (even on a base M4 or 'regular' Intel/AMD system) is like an order of magnitude faster than an SSD, even Qwen 27B running from RAM will be much faster than any Deepseek V4 model with SSD offloading. And the MoE will be much faster still.

Qwen 27B is also small enough to completely fit in a high-end consumer or mid-end pro GPU, like an RTX 5090 or Radeon PRO R9700. I found results claiming 30 tokens per second generation for 27B(-Q4_K_XL) on an R9700. I doubt you get more than 5 tokens per second doing SSD MoE streaming.

Even for relatively short contexts, I honestly already find the ~30B class MoE models to be only borderline acceptable in terms of speed on my laptop (Ryzen 7 7840U, 64 GB LPDDR5-6400), though I use Gemma 4 26B-A4B more than Qwen3.6 35B-A3B.

New comment by ColonelPhantom in "Show HN: Rust but Lisp"

ColonelPhantom — Sun, 10 May 2026 08:32:08 +0000

Carp is memory safe via linear types + references, similar to Rust, so I would not describe it as C-like but rather Rust-like.

New comment by ColonelPhantom in "The text mode lie: why modern TUIs are a nightmare for accessibility"

ColonelPhantom — Mon, 04 May 2026 05:40:08 +0000

But what _is_ a "Text User Interface"? Google Images just returns what is being discussed here: "GUIs" that run in some kind of text mode. And to me, that's also what a TUI is.

A more textually oriented environment (like a normal Unix shell) is, in my experience, usually referred to as a CLI: Command Line Interface.

I did find an interesting hybrid in the Pi coding agent: it seems to leverage the normal terminal scrollback, while still enhancing it with things like transient input fields and status lines, so that it can display those without cluttering scrollback.

New comment by ColonelPhantom in "Framework Laptop 13 Pro"

ColonelPhantom — Tue, 21 Apr 2026 19:34:09 +0000

You mentioned Strix Halo, which also has off-die memory. Strix Halo does have a real advantage from its wider memory bus (four channels for 256 bit instead of 128 bit), but Strix Point is equivalent-ish to Intel's platforms like Panther Lake or Arrow Lake in terms of memory setup.

In fact, Intel also had Lunar Lake, which had on-package memory. However, it was still limited to 128-bit dual-channel, so there weren't really many performance benefits; it did however help with power efficiency.

New comment by ColonelPhantom in "Framework Laptop 13 Pro"

ColonelPhantom — Tue, 21 Apr 2026 19:30:58 +0000

Hilariously, those AMD chips are way behind the Intels in terms of memory.

First off, I believe that Intel has its memory far more "unified". AMD typically has a stricter VRAM/RAM 'tradeoff' setting that does not exist on Intel in the same way to my knowledge. (See how on Strix Halo systems, there is a thing about "allocating" 96 GB to the GPU, which seems to be needed sometimes but prevents the CPU from accessing that memory.)

Secondly, the Panther Lake board has LPDDR5X LPCAMM2 memory at 7467 MT/s, while the AMD boards are stuck with DDR5 SODIMMs at a meagre 5600 MT/s. In other words, the Intel board gets a third more memory bandwidth!

New comment by ColonelPhantom in "Every GPU That Mattered"

ColonelPhantom — Wed, 08 Apr 2026 02:34:23 +0000

Nvidia Turing (RTX 20) definitely marked a major shift IMO.

- It was the first card to enable real-time ray-traced effects. - Mesh shaders are a significant overhaul of the geometry pipeline that's only recently getting real traction. - Its tensor cores enabled a new generation of AI-driven upscaling/antialiasing. DLSS 2, FSR 4 and XeSS are all some variation of "TAA + neural networks", and these all rely on specialized matrix hardware to get optimal performance.

Obviously all of these features are supported across all vendors. Intel Arc Alchemist has all of these features as well, and AMD got RT and mesh shader support with RDNA2 along with slowly building up to tensor cores with RDNA3/4. But Turing clearly debuted these feature which have majorly changed the landscape of realtime 3D graphics.

New comment by ColonelPhantom in "Intel Announces Arc Pro B70 and Arc Pro B65 GPUs"

ColonelPhantom — Thu, 26 Mar 2026 20:44:05 +0000

838 seems to be the real INT8 TOPS number for the 5090; going from 800 to 3400 takes an x2 speedup for sparsity (so skipping ops) and another x2 speedup for FP4 over INT8.

So it's closer to half the speed than a tenth. Intel also seems to be positioning this card against the RTX PRO 4000 Blackwell, not the 5090, and that one gets more like 300 INT8 TOPS. It also has less memory but at a slightly higher bandwidth. The 5090 is much faster and IIRC priced similarly to the PRO 4000, but is also decidedly a consumer product which, especially for Nvidia, comes with limitations (e.g. no server-friendly form factor cards available, and there are or used to be driver license restrictions that prevented using a consumer card in a data center setup).

New comment by ColonelPhantom in "ARM AGI CPU: Specs and SKUs"

ColonelPhantom — Wed, 25 Mar 2026 19:01:05 +0000

Aren't Intel Xeon Rapids and Intel Xeon Forest just different target markets? Rapids has fewer but faster cores in general, and more special-purpose accelerators (e.g. AMX, QAT), while Forest is focused on maximum compute density (just pack in as many fast-enough cores as you can).

IIRC Granite Rapids is also not _that_ old, and either current or a single generation behind. (Has its successor landed yet? IIRC GNR is the same generation as Sierra Forest).

New comment by ColonelPhantom in "Building an FPGA 3dfx Voodoo with Modern RTL Tools"

ColonelPhantom — Mon, 23 Mar 2026 12:49:38 +0000

Very cool! I am wondering one thing: how fast is it? Much of the "secret sauce" of the Voodoo is its high speed: a first-gen Verite or (God forbid) any ViRGE takes many more cycles for common operations like, say, Z-buffered pixels.

I'm guessing this isn't fully cycle-accurate, but is it at least somewhat "IPC-accurate"? I'm guessing yes? But much of that was also derived from Voodoo's (for the time) crazy high memory bandwidth AFAIK.

New comment by ColonelPhantom in "Tinybox – A powerful computer for deep learning"

ColonelPhantom — Mon, 23 Mar 2026 00:14:40 +0000

GPT-OSS is tailored to be extremely memory efficient. Not only is it natively using the 4.25 bit per token MXFP4 format, but it also uses sliding window attention for half of its layers. It also doesn't have that many layers, only 36 for the 120B version and 24 for the 120B version. (The 120B is also much much sparser than the 20B.)

I found a Reddit comment claiming only 36 KiB per token. With that, half a million tokens fits in 18 GB, which is less than one GPU. And three GPUs fit the parameters with room to spare (64 out of 72 GB).

New comment by ColonelPhantom in "NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute"

ColonelPhantom — Fri, 20 Mar 2026 13:28:11 +0000

Interesting; I was not aware of those "universal synthetics" but they make sense: a stronger reasoning base would make modeling tasks easier. Thanks for the link!

Again, though, if those work I assume they will be used for the slowrun. Surely a few hundred LoC to generate data would not be considered cheating :)

New comment by ColonelPhantom in "NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute"

ColonelPhantom — Fri, 20 Mar 2026 11:49:29 +0000

If generating synthetic data is such a great way to improve performance, why would it not be applied to the slowrun? Especially for the unlimited compute track, you should have plenty of time to generate as much synthetic data as your heart desires.

Intuitively, I would expect the synthetic data to mostly just "regurgitate" the existing data, and not add much. But I could be wrong of course, and perhaps doing reinforcement learning somewhere could solve that issue as well (though I don't know if there is much hidden in FineWeb that you could RL on; at best you can do self-verification probably?)

New comment by ColonelPhantom in "Show HN: Axe – A 12MB binary that replaces your AI framework"

ColonelPhantom — Fri, 13 Mar 2026 09:26:07 +0000

I like the idea of LLM-calling as an automation-friendly CLI tool! However, putting all my agents in ~/.config feels antithetical to this. My Bash scripts do not live there either, but rather in a separate script collection, or preferably, at their place of use (e.g. in a repo).

For example, let's say I want to add commit message generation (which I don't think is a great use of LLMs, but it is a practical example) to a repo. I would add the appropriate hook to /.git, but I would also want the agent with its instructions to live inside the repo (perhaps in an `axe` or `agents` directory).

Can Axe load agents from the current folder? Or can that be added?

New comment by ColonelPhantom in "Forcing Flash Attention onto a TPU and Learning the Hard Way"

ColonelPhantom — Fri, 13 Mar 2026 00:46:56 +0000

Interesting read! One remark though: I'm not too familiar with the architecture of a Google TPU, but comparing the TPU's VMEM with Nvidia's shared memory feels wrong to me.

Looking at the size, and its shared nature, it feels far more natural to compare with the L2 cache, which is also shared across the entire GPU and is in the same order of size (40MB on the listed A100).

New comment by ColonelPhantom in "MacBook Pro with M5 Pro and M5 Max"

ColonelPhantom — Wed, 04 Mar 2026 12:09:52 +0000

The reason for that is that most memory bandwidth bumps come with new memory generations. For example an early DDR4 platform (e.g. Intel Skylake/Core iX-6000) and a late one (e.g. AMD Zen3/Ryzen 5000) only differ by 1.5x as well, typically.

The same trend is visible in GPUs: for example, my RTX 2070 (GDDR6) has the same memory bandwidth as a 3070 and only a little bit less than a 4070 (GDDR6X). However, a 5070 does get significantly more bandwidth due to the jump to GDDR7. Lower-end cards like the 4060 even stuck to GDDR6, which gave them a bandwidth deficit compared to a 3060 due to the narrower memory buses on the 40 series.

New comment by ColonelPhantom in "Qwen 3.5 small models out"

ColonelPhantom — Wed, 25 Feb 2026 07:39:15 +0000

It's not just Qwen; we also recently had GLM-4.7-Flash in the same roughly 30B-A3 range. Seems to me like there's no shortage of competition for good old GPT-OSS 20B (not just Qwen3.5-35B and GLM-4.7-Flash, but also Qwen3(-Coder)-30B or Granite 4 Small).