Hacker News: RandyOrion

New comment by RandyOrion in "Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency"

RandyOrion — Sat, 06 Jun 2026 05:42:34 +0000

More rants about local inference, consider yourself warned.

Together with bf16 related deliberate hardward degrades on consumer-level nvidia gpus, i.e., gtx 10, rtx 20, 30, 40, 50 series, things gets sour really quickly.

New comment by RandyOrion in "Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency"

RandyOrion — Sat, 06 Jun 2026 05:41:06 +0000

From the perspective of a local llm user, I think the qat doesn't solve the major problem of the gemma models.

Gemma family (gen 1 to gen 4) is consistent with extreme range of activations, i.e., 600000, essentially forcing people to use bf16 kv cache and accept a short context window, e.g., 31b, iq4_xs quantization, 100k context window on 32gb memory. Or, people use q8 kv cache, 200k context window, and accept a large performance penalty.

In contrast, for qwen 3.5 family, the largest activation is below 2000, making q8 or even lower-precision kv cache essentially free estates. Together with linear attention, which doesn't require kv cache, full 262k context window can be easily reached.

Qat training with w4a16 target, while improving performance on inference with low-precision weighs, doesn't solve kv cache problem at all.

In the end, a qat is a qat, and there are unseen efforts behind qat checkpoints. Thank you gemma team for releasing qat checkpoints.

New comment by RandyOrion in "Gemma 4 12B: A unified, encoder-free multimodal model"

RandyOrion — Wed, 03 Jun 2026 18:05:23 +0000

A small dense multimodal model with audio support, interesting.

Wait, *Excluding Chinese language.

This is ... curious.

P.S. Where is gemma 4 124b?

New comment by RandyOrion in "Show HN: Hallucinopedia"

RandyOrion — Fri, 08 May 2026 09:17:13 +0000

This website brings me some good chuckles. Now I really know how powerful an on-demand bullsh*t generator is.

New comment by RandyOrion in "Google Chrome silently installs a 4 GB AI model on your device without consent"

RandyOrion — Tue, 05 May 2026 13:41:56 +0000

Like the recent copilot silent signing incident, the without consent part is blatant foul move.

If you don't like be treated like anything but human, you should seriously consider replacing chrome with ungoogled chromium or other browsers.

New comment by RandyOrion in "VS Code inserting 'Co-Authored-by Copilot' into commits regardless of usage"

RandyOrion — Sun, 03 May 2026 03:11:30 +0000

Yeah, this is part of the reason why vscodium exists.

New comment by RandyOrion in "VS Code inserting 'Co-Authored-by Copilot' into commits regardless of usage"

RandyOrion — Sun, 03 May 2026 01:54:39 +0000

Wow. Just like using ungoogled-chromium instead of chrome, lineage os instead of oem android, using vscodium instead of vscode is again justified. These decisions really are the ones that I'll never regret.

In addition, using the word microslop instead of microsoft is again justified, too.

New comment by RandyOrion in "Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library"

RandyOrion — Fri, 01 May 2026 03:18:51 +0000

One thing that makes me wonder is that there are 4 security issues raised and all of them were automatically commented and closed by some bot called `pl-ghost` [1][2][3][4]. In the end, only this one [4] properly handled, and all bot comments are deleted. You can see the bot comments in another report [5], which is more informative than the OP one.

[1] https://github.com/Lightning-AI/pytorch-lightning/issues/216...

[2] https://github.com/Lightning-AI/pytorch-lightning/issues/216...

[3] https://github.com/Lightning-AI/pytorch-lightning/issues/216...

[4] https://github.com/Lightning-AI/pytorch-lightning/issues/216...

[5] https://socket.dev/blog/lightning-pypi-package-compromised

New comment by RandyOrion in "Granite 4.1: IBM's 8B Model Matching 32B MoE"

RandyOrion — Thu, 30 Apr 2026 17:06:41 +0000

Although the performance claim of 8b dense matching 32b moe is somewhat questionable, thank you granite team for releasing small dense LLMs.

New comment by RandyOrion in "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model"

RandyOrion — Thu, 23 Apr 2026 03:13:14 +0000

Thank you Qwen team. Small DENSE LLMs shapes the future of local LLM users.

When Qwen 3.5 27b released, I didn't really understand why linear attention is used instead of full attention because of the performance degradation and problems introduced with extra (linear) operators. After doing some tests, I found that with llama.cpp and IQ4_XS quant, the model and BF16 cache of the whole 262k context just fit on 32GB vram, which is impossible with full attention. In contrast, with gemma 4 31b IQ4_XS quant I have to use Q8_0 cache to fit 262k context on the vram, which is a little annoying (no offenses, thank you gemma team, too).

From benchmarks, 3.5->3.6 upgrade is about agent things. I hope future upgrades fix some problems I found, e.g., output repetitiveness in long conversations and knowledge broadness.

New comment by RandyOrion in "MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU"

RandyOrion — Thu, 09 Apr 2026 03:01:13 +0000

Check out Fig. 6 in this paper, it shows the comparison between the proposed method and pytorch native FSDP offload method.

New comment by RandyOrion in "Muse Spark: Scaling towards personal superintelligence"

RandyOrion — Thu, 09 Apr 2026 02:10:32 +0000

No open weights.

Besides, I'm old enough to recall that META has trained a version of LLAMA 4 specifically for LM arena elo benchmaxxing and PR things, and proceeded to release a different version of LLAMA 4.

New comment by RandyOrion in "Google releases Gemma 4 open models"

RandyOrion — Fri, 03 Apr 2026 07:53:28 +0000

Thank you Gemma team for releasing small dense VLM(s).

The elo ranking [1] is too good to be true. I don't know why gemma-4-26b-a4b performs better than gemma-4-31b.

Also waiting for more bugfixes in llama.cpp, sglang and vllm to do proper evaluations.

[1] https://arena.ai/leaderboard/text/expert?license=open-source

New comment by RandyOrion in "Android Developer Verification"

RandyOrion — Tue, 31 Mar 2026 11:08:55 +0000

Please no.

If you want to install APKs directly on Android phones selling in China, you'll face even more draconian restrictions imposed by both Chinese OEMs and Chinese government, e.g., cannot install telegram [1], cannot install VPNs [2], called by local police station after installing VPNs [3], and so on. And you do not have the freedom to even talk about these restrictions freely without getting sued or censored.

[1] https://xcancel.com/whyyoutouzhele/status/168915238841261670...

[2] https://xcancel.com/whyyoutouzhele/status/197843066556268971...

[3] https://xcancel.com/whyyoutouzhele/status/170299205759627676...

New comment by RandyOrion in "Android Developer Verification"

RandyOrion — Tue, 31 Mar 2026 10:47:28 +0000

Yeah, let's hold Google accountable. Is there a way to practice anti-trust laws?

New comment by RandyOrion in "Android Developer Verification"

RandyOrion — Tue, 31 Mar 2026 10:35:19 +0000

Thank you for standing against the Android Developer Verification enforced by Google. Now in addition to stopping using Youtube, replacing chrome with ungoogled chromium, I'm moving to de-googled AOSP builds, e.g., lineageOS, insted of stock OEM ROMs.

New comment by RandyOrion in "Copilot edited an ad into my PR"

RandyOrion — Tue, 31 Mar 2026 02:41:19 +0000

Wow, just wow.

1.5M records of PRs affected. Does Microsoft copilot ask users for the permission of adding ads inside their PRs before actually doing the thing? Do users show their consents on this matter?

Now EVERYONE can see ads disguised as PRs on GitHub. Does Microsoft asks everyone for the permission of showing ads before actually doing the thing? Do users show their consents on this matter?

Good taste Microslop.

New comment by RandyOrion in "Flash-MoE: Running a 397B Parameter Model on a Laptop"

RandyOrion — Mon, 23 Mar 2026 02:26:52 +0000

This project shows an interesting automated search for engineering problems that I like to see more.

The experience of utilizing tiered storage (gpu vram, ram, and ssd) is generally poor for a lot of LLM inference engines out there, e.g., llama.cpp, sglang, vllm, etc..

My own experience shows that both weight and KV cache offload to ram on sglang and vllm is unavailable or unusable. Copying extra parameters from documents and adding them to already working commands results in errors. Llama.cpp does support weight offload, but the experience is not pleasant, low pcie (gpu <-> ram) utilization, low gpu utilization, and really low tokens per second.

New comment by RandyOrion in "Something is afoot in the land of Qwen"

RandyOrion — Thu, 05 Mar 2026 04:12:47 +0000

First, thank you Junyang and Qwen team for your incredible work. You deserve better.

This is sad for local LLM community. First we lost wizardLM, Yi and others, then we lost Llama and others, now we lost Qwen...

New comment by RandyOrion in "A CPU that runs entirely on GPU"

RandyOrion — Wed, 04 Mar 2026 19:19:09 +0000

Well, I don't have enough knowledge on the boot process of RPi. However, I do expect that most modern hardware, e.g. x86, do not work like RPi, so your words do not hold in most realistic scenarios, at least for now. Besides, do current GPUs (not only GPUs on RPi) have the ability to self instruct in order to achieve what you said?