Hacker News: lhl

New comment by lhl in "DeepSeek 4 Flash local inference engine for Metal"

lhl — Mon, 18 May 2026 07:47:38 +0000

It's still Python, but I removed torch dependencies (HIP/C++ for hot paths): http://github.com/shisa-ai/hipEngine/

There's a docs/ folder in there that is probably of interest as well.

New comment by lhl in "DeepSeek 4 Flash local inference engine for Metal"

lhl — Mon, 18 May 2026 07:44:58 +0000

Took a little longer to clean up than I expected. I'd recommend checking out the ROOFLINE and the LESSONS-LEARNED docs here: https://github.com/shisa-ai/hipEngine/tree/main/docs

New comment by lhl in "DeepSeek 4 Flash local inference engine for Metal"

lhl — Fri, 08 May 2026 09:16:53 +0000

When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:

### Diagnosing parallelism pathologies (L1)

*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.

*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.

*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.

New comment by lhl in "DeepSeek 4 Flash local inference engine for Metal"

lhl — Fri, 08 May 2026 00:10:15 +0000

I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.

Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

New comment by lhl in "StarFighter 16-Inch"

lhl — Wed, 06 May 2026 09:16:29 +0000

Oh, is this actually out now? If so, great, but I took a quick look and didn't spot any third party review yet. For those interested in this laptop, personally I'd still wait for some reviews from some real world people.

Some history on this laptop:

- The StarFighter 16 was originally announced back in November 2022 with an original delivery timeline of 3-4 months: https://www.reddit.com/r/linuxhardware/comments/yjuahx/star_...

- Here's a 500-comment HN thread from Feb 2023 about it (3-4 months later) now with an additional 4-5 month lead time: https://news.ycombinator.com/item?id=34759507

- The latest production updates only go back to July 31 2025 - they mention a 3-5 month timeline from January 2025 (seeing a pattern?): https://starlabs.kb.help/starfighter-production-updates/

There's an "Unboxing" video from Star Labs on the StarFighter from January 22, 2026: https://www.youtube.com/watch?v=HjYJS5AJZpE

So, 3.5 years later, the chassis is still neat, and good on them for plugging away I guess, but for anyone that actually needs a new computer, there's no shortage of higher-end Linux-centric laptops with a better shipping track record (Framework, Tuxedo Computers, Slimbook, etc).

New comment by lhl in "DeepClaude – Claude Code agent loop with DeepSeek V4 Pro"

lhl — Mon, 04 May 2026 10:04:37 +0000

For those that don't want their data trained on, OpenRouter allows you to have account-wide or per-request routing with either provider.data_collection: "deny" or zdr: true (zero data retention).

Also, you can use HuggingFace Inference for DeepSeek V4 or Kimi K2.6, both of which work quite well and route through providers that you can enable/disable (like Together AI, DeepInfra, etc) - you'll have to check their policies but I think most of those commercial inference providers claim to not train on your data either.

New comment by lhl in "Why isn't AMD's MI300X competitive?"

lhl — Thu, 30 Apr 2026 15:23:06 +0000

RDNA is a whole different (and much poorer supported) animal than CDNA. As someone with extensive experience in both, if you're asking the question, then, no.

(If you're just looking to learn, use the free Kaggle/Google Cola T4s/TPUs to get started.)

New comment by lhl in "The RAM shortage could last years"

lhl — Sun, 19 Apr 2026 09:32:02 +0000

BTW, a number of corrections. The TurboQuant paper was submitted to Arxiv back in April 2025: https://arxiv.org/abs/2504.19874

Current "TurboQuant" implementations are about 3.8X-4.9X on compression (w/ the higher end taking some significant hits of GSM8K performance) and with about 80-100% baseline speed (no improvement, regression): https://github.com/vllm-project/vllm/pull/38479

For those not paying attention, it's probably worth sending this and ongoing discussion for vLLM https://github.com/vllm-project/vllm/issues/38171 and llama.cpp through your summarizer of choice - TurboQuant is fine, but not a magic bullet. Personally, I've been experimenting with DMS and I think it has a lot more promise and can be stacked with various quantization schemes.

The biggest savings in kvcache though is in improved model architecture. Gemma 4's SWA/global hybrid saves up to 10X kvcache, MLA/DSA (the latter that helps solve global attention compute) does as well, and using linear, SSM layers saves even more.

None of these reduce memory demand (Jevon's paradox, etc), though. Looking at my coding tools, I'm using about 10-15B cached tokens/mo currently (was 5-8B a couple months ago) and while I think I'm probably above average on the curve, I don't consider myself doing anything especially crazy and this year, between mainstream developers, and more and more agents, I don't think there's really any limit to the number of tokens that people will want to consume.

New comment by lhl in "Sam Altman may control our future – can he be trusted?"

lhl — Tue, 07 Apr 2026 06:10:38 +0000

As some other people mentioned, using both/multiple is the way to go if it's within your means.

I've been working on a wide range of relatively projects and I find that the latest GPT-5.2+ models seem to be generally better coders than Opus 4.6, however the latter tends to be better at big picture thinking, structuring, and communicating so I tend to iterate through Opus 4.6 max -> GPT-5.2 xhigh -> GPT-5.3-Codex xhigh -> GPT-5.4 xhigh. I've found GPT-5.3-Codex is the most detail oriented, but not necessarily the best coder. One interesting thing is for my high-stakes project, I have one coder lane but use all the models do independent review and they tend to catch different subsets of implementation bugs. I also notice huge behavioral changes based on changing AGENTS.md.

In terms of the apps, while Claude Code was ahead for a long while, I'd say Codex has largely caught up in terms of ergonomics, and in some things, like the way it let's you inline or append steering, I like it better now (or where it's far, far, ahead - the compaction is night and day better in Codex).

(These observations are based on about 10-20B/mo combined cached tokens, human-in-the-loop, so heavy usage and most code I no longer eyeball, but not dark factory/slop cannon levels. I haven't found (or built) a multi-agent control plane I really like yet.)

New comment by lhl in "So where are all the AI apps?"

lhl — Tue, 24 Mar 2026 17:02:26 +0000

Like others have mentioned, I think the premise of looking at the most popular few projects (pypi.org currently lists 771,120 projects) on pypi as any sort of proxy for AI coding is terribly misguided/unrepresentative and that almost no one is going to be packaging up their vibe-coded projects for distribution on pypi.

That being said, I've personally put 3 up recently (more than I've published in total). I'm sure they have close to zero downloads (why would they? they're brand new, solve my own problems, I'm not interested in marketing them or supporting them, they're just shared because they might be useful to others) so they wouldn't show up in their review. 2 of these are pretty meaty projects that would have taken weeks if not months of work but instead have been largely just built over a weekend or a few days. I'd say it's not just the speed, but that w/o the lowered effort, these projects just wouldn't ever have crossed the effort/need bar of ever being started.

I've probably coded 50-100X more AI-assisted code that will never go to pypi, even as someone that has released pypi packages before (which already puts me in a tiny minority of programmers, much less regular people that would even think about uploading a pypi project).

For those interested in the scope of the recent projects:

https://pypi.org/project/realitycheck/ - first pypi: Jan 21 - 57K SLoC - "weekend" project that kept growing. It's a framework that leverages agentic coding tools like Codex/Claude Code to do rigorous, systematic analysis of claims, sources, predictions, and argument chains.It has 400+ tests, and does basically everything I want it to do now. The repo has 20 stars and I'd estimate only a handful of people are using it.

https://pypi.org/project/tweetxvault/ - first pypi: Mar 16 - 29K SLoC - another weekend project (followup on a second weekend). This project is a tool for archiving your Twitter/X bookmarks, likes, and tweets into a local db, with support for importing from archives and letting you search through them. I actually found 3 or 4 other AI-coded projects that didn't do quite what I wanted so it I built my own. This repo has 4 stars, although a friend submitted a PR and mentioned it solved exactly their problem and saved them from having to build it themselves, so that was nice and justifies publishing for me.

https://pypi.org/project/batterylog/ - first pypi: Mar 22 - 857 SLoC - this project is actually something I wrote (and have been using daily) 3-4 years ago, but never bothered to properly package up - it tracks how much battery is drained by your laptop when asleep and it's basically the bare minimum script/installer to be useful. I never bothered to package it up b/c quite frankly, manual pypi releases are enough of a PITA to not bother, but LLMs now basically make it a matter of saying "cut a release," so when I wanted to add a new feature, I packaged it up as well, which I would never have done this otherwise. This repo has 42 stars and a few forks, although probably 0 downloads from pypi.

(I've spent the past couple years heavily using AI-assisted workflows, and only in the past few months (post Opus 4.6, GPT-5.2) would I have even considered AI tools reliable enough to consider trusting them to push new packages to pypi.)

New comment by lhl in "Wayland set the Linux Desktop back by 10 years?"

lhl — Fri, 20 Mar 2026 03:42:03 +0000

Funy that you mention multi-monitor since it's one of the reasons I eventually moved to Wayland. The only way to support different DPI monitors in X was to do janky scaling or even jankier multiple X servers.

I don't use KDE (or GNOME anymore) but while I had to deal with a lot of initial speedbumps a couple years ago, these days instead of a full DE, I'm using a Niri setup and it's worked out great for me.

For my laptop, I have my own monitor-detection/wl-mirror script for example that is faster and more reliable for plugging into projectors/meeting room HDMI than even my old Macs.

New comment by lhl in "Claude's Cycles [pdf]"

lhl — Sat, 07 Mar 2026 04:31:04 +0000

Yes, I read it and specifically pointed it out (that's why there are 3 hours of interactive logs). There are 4 other runs pushed now so you can see what actual clean room runs for 5.2 xhigh, 5.3-Codex xhigh, 5.4 xhigh, and Opus 4.6 ultrathink look like: https://github.com/lhl/claudecycles-revisited/blob/main/COMP... as well as the baseline.

New comment by lhl in "Claude's Cycles [pdf]"

lhl — Thu, 05 Mar 2026 15:24:50 +0000

I am not a theoretical CS or math expert by any means, but I have been wrangling coding agents for a while and reading the paper and the problems Stapper had with dealing w/ Claude (context management, instruction following, etc) decided to see if I could replicate with a slightly better harness. The results were pretty interesting: https://github.com/lhl/claudecycles-revisited

- My original setup left traces of the PDF paper and after GPT 5.3-Codex xhigh reached an impasse it went looking for it and found it!

- I went and did cleanroom (basically one-shot) passes for GPT 5.2 xhigh, GPT 5.3-Codex xhigh, and Claude Opus 4.6 ultrathink and 5.2/5.3 found alternate solutions for odd m >= 5 , Opus 4.6 did not find any proofs but tried more approaches to solving.

Full comparison/analysis here: https://github.com/lhl/claudecycles-revisited/blob/main/COMP...

I've also included the session traces and analysis in the repo branches. Also, the AGENTS.md was pretty simple, but that harness produced consistent process outcomes across all three models:

- All built verifiers first

- All maintained worklogs with exact commands

- All archived machine-readable artifacts

- All documented failed approaches

- All maintained restart-safe context capsules

New comment by lhl in "Claude's Cycles [pdf]"

lhl — Wed, 04 Mar 2026 14:59:20 +0000

I was a bit interested to do a replication and see if better harness could avoid some of the problems they ran w/ context management, poor instruction following, etc and it looks like yes, it's definitely possible.

Here's my repo: https://github.com/lhl/claudecycles-revisited

I used Codex w/ 5.2 xhigh and a relatively simple AGENTS.md - I have some session-analysis as well. The original replication was 47 minutes, then another 30 minutes of gap filling, and finally about 30 minutes of writing an extension to take the work a bit further, with Claude Code Opus 4.6 doing some documentation cleanup and verification.

New comment by lhl in "GPT-5.2"

lhl — Fri, 12 Dec 2025 06:43:32 +0000

Anecdotally, I will say that for my toughest jobs GPT-5+ High in `codex` has been the best tool I've used - CUDA->HIP porting, finding bugs in torch, websockets, etc, it's able to test, reason deeply and find bugs. It can't make UI code for it's life however.

Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...

Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.

New comment by lhl in "Kimi Linear: An Expressive, Efficient Attention Architecture"

lhl — Fri, 31 Oct 2025 11:37:07 +0000

We do live in an age of frontier LLMs... For fun, I'll just use Kimi K2 (on Kagi Assistant).

> Can you explain what this means and its significance? Assume that I'm a layperson with no familiarity with LLM jargon so explain all of the technical terms, references, names. https://github.com/MoonshotAI/Kimi-Linear

Imagine your brain could only “look at” a few words at a time when you read a long letter. Today’s big language models (the AI that powers chatbots) have the same problem: the longer the letter gets, the more scratch paper they need to keep track of it all. That scratch paper is called the “KV cache,” and for a 1 000 000-word letter it can fill a small library.

Kimi Linear is a new way for the AI to read and write that throws away most of that scratch paper yet still understands the letter. It does this by replacing the usual “look at every word every time” trick (full attention) with a clever shortcut called linear attention. The shortcut is packaged into something they call Kimi Delta Attention (KDA).

What the numbers mean in plain English

    51.0 on MMLU-Pro: on a 4 000-word school-test set, the shortcut scores about as well as the old, slow method.
    84.3 on RULER at 128 000 words: on a much longer test it keeps the quality high while running almost four times faster.
    6 × faster TPOT: when the AI is writing its reply, each new word appears up to six times sooner than with the previous best shortcut (MLA).
    75 % smaller KV cache: the scratch paper is only one-quarter the usual size, so you can fit longer conversations in the same memory.

Key pieces explained

    Full attention: the old, accurate but slow “look back at every word” method.
    KV cache: the scratch paper that stores which words were already seen.
    Linear attention: a faster but traditionally weaker way of summarising what was read.
    Gated DeltaNet: an improved linear attention trick that keeps the most useful bits of the summary.
    Kimi Delta Attention (KDA): Moonshot’s even better version of Gated DeltaNet.
    Hybrid 3:1 mix: three layers use the fast KDA shortcut, one layer still uses the old reliable full attention, giving speed without losing smarts.
    48 B total, 3 B active: the model has 48 billion total parameters but only 3 billion “turn on” for any given word, saving compute.
    Context length 1 M: it can keep track of about 1 000 000 words in one go—longer than most novels.

Bottom line Kimi Linear lets an AI read very long documents or hold very long conversations with far less memory and much less waiting time, while still giving answers as good as—or better than—the big, slow models we use today.

New comment by lhl in "Niri – A scrollable-tiling Wayland compositor"

lhl — Fri, 03 Oct 2025 15:31:18 +0000

Looks great, thanks for the suggestion!

New comment by lhl in "Niri – A scrollable-tiling Wayland compositor"

lhl — Fri, 03 Oct 2025 14:27:51 +0000

I'd previously been giving Hyprland a try, but after lots of customization work, there were still a bunch of things I wasn't happy with and ended back on GNOME as a "just let me get work done" thing (I use multiple workspaces, have always have dozens or hundreds of browser windows open, depend on a bunch tray extensions). That being said, GNOME just updated versions and broke all my extensions again so I've decided to recommit to work on fixing anything that isn't working for my workflow and ditching GNOME forever (I was previously much happier on Openbox, but well, Wayland).

With this latest go I gave River, QTile, and Niri a try. After a bit of swapping back and forth, I've settled on Niri and am slowly adding functionality I'm missing.

- I like multiple dynamic workspaces (grouped by function) and don't see much point beyond a split or two so Niri worked pretty well, and I was able to largely config all the keyboard shortcuts to something that made sense to me

- I'm using waybar and swaync for my other DE bits

I've also been using long running Claude Code/Codex in a workspace to build a number of custom scripts:

- niri-workspaces - dynamically generate a workspace display on my waybar showing windows, activity

- niri-workspace-names - integrate w/ fuzzel to let me rename workpaces

- niri-alttab - getting app cycling working in a way that makes sense to me, this is a larger project probably if I want live thumbnails and the like

- niri-terminal-below - I often want to have a new vertical terminal split and it's a bit hacky but works (have to punch out a new terminal, then bring it below, and move back if on the right side)

I haven't gone through all the docs, done much looking around, but one nice thing with these new coding agents is that they can just go and do a passable job to tweak as I want.

New comment by lhl in "Nvidia DGX Spark"

lhl — Thu, 28 Aug 2025 07:04:44 +0000

In Linux, you can set it as high as you want, although you should probably have a swap drive and still be prepared for you system to die if you set it to 128GiB. Here's how you'd set it to 120GiB:

    # This is deprecated, but can still be referenced
    options amdgpu gttsize=122800

    # This specifies GTT by # of 4KB pages:
    #   31457280 * 4KB / 1024 / 1024 = 120 GiB
    options ttm pages_limit=31457280

New comment by lhl in "Nvidia DGX Spark"

lhl — Thu, 28 Aug 2025 06:59:53 +0000

RDNA3 CUs do not have FP8 support and its INT8 runs at the same speed as FP16 so Strix Halo's max theoretical is basically 60 TFLOPS no matter how you slice it (well it has double INT4, but I'm unclear on how generally useful that is):

    512 ops/clock/CU * 40 CU * 2.9e9 clock / 1e12 = 59.392 FP16 TFLOPS

Note, even with all my latest manual compilation whistles and the latest TheRock ROCm builds the best I've gotten mamf-finder up to about 35 TFLOPS, which is still not amazing efficiency (most Nvidia cards are at 70-80%), although a huge improvement over the single-digit TFLOPS you might get ootb.

If you're not training, your inference speed will largely be limited by available memory bandwidth, so the Spark token generation will be about the same as the 395.

On general utility, I will say that the 16 Zen5 cores are impressive. It beats my 24C EPYC 9274F in single and multithreaded workloads by about 25%.