Hacker News: GreenGames

A 35B MoE on a 16 GB GPU, without the offload tax

GreenGames — Mon, 08 Jun 2026 15:29:58 +0000

Article URL: https://www.lucebox.com/blog/spark

Comments URL: https://news.ycombinator.com/item?id=48446667

Points: 2

# Comments: 0

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

GreenGames — Fri, 01 May 2026 14:34:06 +0000

Article URL: https://github.com/Luce-Org/lucebox-hub/tree/main/pflash

Comments URL: https://news.ycombinator.com/item?id=47975259

Points: 3

# Comments: 1

New comment by GreenGames in "We got 207 tok/s with Qwen3.5-27B on an RTX 3090"

GreenGames — Mon, 20 Apr 2026 21:53:36 +0000

This reads like you didn’t read the post.

z-lab runs BF16 on B200 (54+ GB). There is no z-lab path that fits on a 24 GB 3090. That is literally the entire point of our work, and it is stated in the second paragraph. If you had checked the HF model card you linked before posting, you would see the same thing. Before this repo, there was no path to run this... SGLang's GGUF path for this model is broken. llama.cpp doesn't have DFlash speculative decoding at all. If you wanted to run this hybrid model fast on a 24 GB consumer card, there was nothing...

That took weeks of real engineering.

Calling that "vibecoded" because we used a bit of AI in the README is clean is the laziest possible critique. An LLM reading the DFlash paper does not catch verify_logits_buf being sized vocabq_len when DDTree reads vocab(budget+1). That is hours of debugging with nvidia-smi and memory sanitizers, not prompting.

The 207 and 129.5 numbers are both in the second sentence of the post and again in the TL;DR. 207.6 is peak tok/s in the linked demo video, 129.5 is the HumanEval 10-prompt mean at DDTree budget=22. We specify both just behind the title.

On the Q4 KV cache: the tradeoff is disclosed with actual numbers. AL 8.56 -> 8.33 at short context (3% drop), dramatically better at long context. It’s the only way 128K allocates on 24 GB. The binary is env-selectable, you can run BF16 KV if you don’t need 128K. Both are benchmarked.

New comment by GreenGames in "We got 207 tok/s with Qwen3.5-27B on an RTX 3090"

GreenGames — Mon, 20 Apr 2026 18:46:22 +0000

We built a standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with a DFlash block-diffusion draft.

207.6 tok/s peak (5.46x over AR); HE 10-prompt bench averages 129.5 tok/s at DDTree budget=22, single RTX 3090, 24 GB. 3.43x over autoregressive and 2.8x over the best public SGLang AWQ number.

TL;DR - Peak 207.6 tok/s DFlash vs 38.0 tok/s AR (5.46x). HE bench: 129.5 tok/s mean at DDTree budget=22. - 3.43x over autoregressive Q4_K_M baseline (37.78 tok/s). - 2.8x vs SGLang AWQ reference (46.6 tok/s) on the same RTX 3090. - 128K context fits on 24 GB. Q4_0 KV + rolling 4096-slot target feature buffer. 134.78 tok/s at ctx=131072. - Only ggml. Never link libllama. ~2000 LOC C++/CUDA in libdflash27b.a around ggml_gated_delta_net.

Why the experiment exists Qwen3.5-27B is a hybrid model: every 4th layer is full softmax attention, the rest (48 of 64) are Gated DeltaNet. SSM state cache alongside the KV cache. That combo doesn't have a good single-3090 decode path today: llama.cpp has the GGUF loader and ggml_gated_delta_net, but no DFlash speculative decoding. vLLM / SGLang ship z-lab's DFlash integration, but only on BF16 (54 GB, doesn't fit on 24 GB). AWQ target on SGLang runs plain AR at 46.6 tok/s but can't host a BF16 draft + DDTree state in 24 GB. z-lab's reference benchmarks run BF16 on B200, 54+ GB class. We wanted the fastest single-3090 decode on a 24 GB card. The answer: port only the graph glue to ggml, keep the existing DeltaNet kernel, run DFlash block-diffusion draft with a DDTree verifier, compress KV to Q4_0 for long context.

From autoregressive to DDTree Same 10-prompt HE bench, n_gen=256, Q4_K_M target, BF16 draft. AL = average accept length. DDTree paper reports +35-42% over chain DFlash on pure-attention Qwen3 variants. On our hybrid Q4_K_M/RTX 3090 combo we see +15% over chain. The gap comes from Q4 quantization flattening the draft softmax, partially patched with a chain pre-seed in build_ddtree. Draft-ceiling bound, not verify-memory bound: a bigger tree won't help, only a better draft will.

Key wins - f16 intermediate cache: half the bandwidth, +5% at the same tree budget. Bit-identical to AR at 40 tokens. - Persist-write kernel (ggml_gated_delta_net_tree_persist): skips a 9 ms ggml_cpy per step, +11%. - target_feat compaction after sibling accept: unlocked real tree rescue on 9/10 prompts. - extract_draft_topk reverse bug: sort_heap + cmp_greater already produces descending order; an extra std::reverse was sending the worst candidate to the tree root. One-line fix. - verify_logits_buf overflow: sized vocabq_len but DDTree reads vocab(budget+1) past budget 15. Silent memory corruption. One-line size fix.

128K context on 24 GB Flash-attention in ggml-cuda supports Q4_0 K+V natively, so KV compression is just ggml_cpy with the F32->Q4_0 quantizer on write. 8x over f16. Combined with a rolling 4096-slot target_feat ring, target_feat shrinks from 6.6 GB to 0.2 GB at 128K. Tradeoffs: Q4_0 KV costs ~3% quality on HE (AL 8.56 -> 8.33) at short context, dramatically better at long ones. Only thing that lets 128K fit on 24 GB.

Prefill Short prompts (<=2048 tok): PREFILL_UBATCH=16. Matches DFlash block size. Long prompts (>2048 tok): auto-bump to PREFILL_UBATCH=192. 13K prefill: 40.9 s -> 15.07 s (2.7x, ~913 tok/s).

What comes next - Daemon mode: keep the model resident, first-token latency 10 s -> ms. - Temperature / top-k sampling in verify. Currently greedy-only. - Q5_K_M / Q6_K: better quants should recover most of the ~30-point accept gap vs BF16. - Full llama.cpp integration: qwen35 arch, llama-speculative-dflash.cpp wiring. - Metal/Vulkan: not planned. CUDA only, anyone who wants Metal can fork.

As soon as Qwen3.6-27B comes out, we'll do the same for it. Repo in the first comment (open source, MIT).

We got 207 tok/s with Qwen3.5-27B on an RTX 3090

GreenGames — Mon, 20 Apr 2026 18:46:22 +0000

Article URL: https://github.com/Luce-Org/lucebox-hub

Comments URL: https://news.ycombinator.com/item?id=47838788

Points: 165

# Comments: 52

Show HN: OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090

GreenGames — Wed, 08 Apr 2026 15:00:51 +0000

Hey there, we fused all 24 layers of Qwen3.5-0.8B (a hybrid DeltaNet + Attention model) into a single CUDA kernel launch and made it open-source for everyone to try it.

On an RTX 3090 power-limited to 220W: - 411 tok/s vs 229 tok/s on M5 Max (1.8x) - 1.87 tok/J, beating M5 Max efficiency - 1.55x faster decode than llama.cpp on the same GPU - 3.4x faster prefill

The RTX 3090 launched in 2020. Everyone calls it power-hungry. It isn't, the software is. The conventional wisdom NVIDIA is fast but thirsty. Apple Silicon is slow but sips power. Pick a side.

With stock frameworks, the numbers back that up: Setup | tok/s | Power | tok/J RTX 3090 (llama.cpp) | 267 | 350W | 0.76 M5 Max (LM Studio) | 229 | ~130W | 1.76

Case closed. Except the 3090 has 936 GB/s of bandwidth and 142 TFLOPS of FP16 compute, and llama.cpp extracts 267 tok/s out of it. That ratio is absurd.

Traditional inference dispatches one kernel per operation. For 24 layers, that's roughly 100 launches per token. Every boundary means: - Return control to the CPU - Dispatch the next kernel - Re-fetch weights from global memory - Synchronize threads

Why nobody had done this yet? Qwen3.5-0.8B isn't a vanilla transformer. It alternates: - 18 DeltaNet layers: linear attention with a learned recurrence - 6 Full Attention layers: standard MHA

This hybrid pattern is where frontier models are heading: Qwen3-Next, Kimi Linear, all of them. DeltaNet scales linearly with context length instead of quadratically.

It's new, and nobody has shipped a fused kernel for it. MLX doesn't have DeltaNet kernels at all. llama.cpp supports it generically. Everyone else is waiting. The 267 tok/s wasn't a hardware ceiling, it was the software ceiling for a brand-new architecture.

We wrote a single CUDA kernel that runs the entire forward pass in one dispatch. Data stays in registers and shared memory as it flows through the network. Zero CPU round-trips, zero redundant memory fetches.

- 82 blocks x 512 threads, all SMs occupied - BF16 weights and activations, FP32 accumulation DeltaNet recurrence runs in warp-cooperative F32 registers - Full attention fuses QKV, RoPE, causal softmax, and output projection - Cooperative grid sync replaces kernel launches between layers

Then we turned the power down Fewer wasted cycles means less heat, so we swept nvidia-smi -pl: Power limit | Clock | Draw | tok/s | tok/J | Notes 420W (stock) | 1980 MHz | 314W | 433 | 1.38 | baseline 300W | 1935 MHz | 299W | 432 | 1.44 | -5% power, 99.8% speed 220W | 1635 MHz | 220W | 411 | 1.87 | -30% power, 95% speed 150W | 405 MHz | 150W | 194 | 1.29 | clock cliff, too aggressive

At 220W we hit the sweet spot: 95% of the throughput for 70% of the power. Tighter execution converts almost directly into saved watts. Measurement: NVML energy counters for NVIDIA, powermetrics for Apple Silicon, matching Hazy Research's Intelligence Per Watt methodology. Accelerator power only, not wall draw.

Without the megakernel the 3090 barely edges out a laptop chip. With it, a five-year-old GPU beats Apple's latest on throughput, matches it on efficiency, and costs a quarter as much. The NVIDIA vs Apple efficiency gap isn't silicon. It's software.

Try it git clone https://github.com/Luce-Org/luce-megakernel.git cd luce-megakernel pip install -e . python bench_pp_tg.py

Requires: NVIDIA Ampere+ (tested on 3090), CUDA 12+, PyTorch 2.0+, ~1.5GB VRAM.

Code is open source (MIT): https://github.com/Luce-Org/luce-megakernel

Let us know if you have any feedback

Comments URL: https://news.ycombinator.com/item?id=47691182

Points: 6

# Comments: 1

New comment by GreenGames in "Why AI code fails differently: What I learned talking to 200 engineering teams"

GreenGames — Wed, 12 Nov 2025 15:45:37 +0000

Super interesting take Paul. Curious btw, how are these teams actually encoding their “institutional knowledge” into constraints? Like is it some manual config or more like natural‑language rules that evolve with the codebase?

New comment by GreenGames in "App-Use, Control Individual Applications with CUA Agents"

GreenGames — Tue, 17 Jun 2025 16:17:25 +0000

Hi there, Alessandro and Francesco here. We just launched an experimental feature in C/ua called App-Use. It lets you create virtual desktops scoped to specific apps (e.g., "Safari and Notes only") to give your agents focused, lightweight control without full-screen access.

Use cases:

- Run multiple agents in parallel with isolated app views

- Automate your iPhone using the iPhone Mirroring app

- Improve agent task precision and reduce VLM distractions

Works only on macOS (Sequoia+) and requires experiments=["app-use"]. No extra processes, just clever compositing.

More details: https://www.trycua.com/blog/app-use

Feedback and experiments welcome!

App-Use, Control Individual Applications with CUA Agents

GreenGames — Tue, 17 Jun 2025 16:17:25 +0000

Article URL: https://www.trycua.com/blog/app-use

Comments URL: https://news.ycombinator.com/item?id=44300812

Points: 2

# Comments: 1

Show HN: Lumier – Run macOS VMs in a Docker

GreenGames — Wed, 14 May 2025 15:19:41 +0000

Hey HN, we're excited to share Lumier (https://github.com/trycua/cua/tree/main/libs/lumier), an open-source tool for running macOS and Linux virtual machines in Docker containers on Apple Silicon Macs.

When building virtualized environments for AI agents, we needed a reproducible way to package and distribute macOS VMs. Inspired by projects like dockur/windows (https://github.com/dockur/windows) that pioneered running Windows in Docker, we wanted to create something similar but optimized for Apple Silicon. The existing solutions either didn't support M-series chips or relied on KVM/Intel emulation, which was slow and cumbersome. We realized we could leverage Apple's Virtualization Framework to create a much better experience.

Lumier takes a different approach: it uses Docker as a delivery mechanism (not for isolation) and connects to a lightweight virtualization service (lume) running on your Mac. This creates true hardware-accelerated VMs using Apple's native virtualization capabilities.

With Lumier, you can: - Launch a ready-to-use macOS VM in minutes with zero manual setup - Access your VM through any web browser via VNC - Share files between your host and VM effortlessly - Use persistent storage or ephemeral mode for quick tests - Automate VM startup with custom scripts

All of this works natively on Apple Silicon (M1/M2/M3/M4) - no emulation required.

To get started:

1. Install Docker for Apple Silicon: https://desktop.docker.com/mac/main/arm64/Docker.dmg

2. Install lume background service with our one-liner:

  /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"

3. Start a VM (ephemeral mode):

  docker run -it --rm \
  --name lumier-vm \
    -p 8006:8006 \
    -e VM_NAME=lumier-vm \
    -e VERSION=ghcr.io/trycua/macos-sequoia-cua:latest \
    -e CPU_CORES=4 \
    -e RAM_SIZE=8192 \
    trycua/lumier:latest

4. Open http://localhost:8006/vnc.html in your browser. The container will generate a unique password for each VM instance - you'll see it in the container logs.

For persistent storage (so your changes survive container restarts):

mkdir -p storage docker run -it --rm \ --name lumier-vm \ -p 8006:8006 \ -v $(pwd)/storage:/storage \ -e VM_NAME=lumier-vm \ -e HOST_STORAGE_PATH=$(pwd)/storage \ trycua/lumier:latest

Want to share files with your VM? Just add another volume:

mkdir -p shared docker run ... -v $(pwd)/shared:/shared -e HOST_SHARED_PATH=$(pwd)/shared ...

You can even automate VM startup by placing an on-logon.sh script in shared/lifecycle/.

We're seeing people use Lumier for: - Development and testing environments that need macOS - CI/CD pipelines for Apple platform apps - Disposable macOS instances for security research - Automated UI testing across macOS versions - Running AI agents in isolated environments

Lumier is 100% open-source under the MIT license. We're actively developing it as part of our work on C/ua (https://github.com/trycua/cua), and we'd love your feedback, bug reports, or feature ideas.

We'll be here to answer any technical questions and look forward to your comments!

Comments URL: https://news.ycombinator.com/item?id=43985624

Points: 159

# Comments: 52