Hacker News: sleepyeldrazi

New comment by sleepyeldrazi in "Qwen 3.7 Preview"

sleepyeldrazi — Mon, 18 May 2026 18:10:23 +0000

Finetuning takes little resources, the base model training is the slow and expensive part. Architecturally 3.5 models are identical to their 3.6 counterparts, that is why there is a consensus that those are probably finetunes and not re-trained from scratch, like you will se many people publish their own on huggingface.

New comment by sleepyeldrazi in "Qwen 3.7 Preview"

sleepyeldrazi — Mon, 18 May 2026 18:07:12 +0000

The best thing I have come up with is just make a bunch of prompts / tasks that I personally care about and need a model to know how to do. As an example, when qwen3.6 27B dropped, I ran it, kimi, claude and glm 5/5.1 on a bunch of LLM-architecture specific tasks (stuff like 'implement an incremental KV-cache for autoregressive transformer inference' or 'implement flash Attention backward pass with D-optimization') and analyze the results, who made tests, are the tests valid, does their implementation actually work or are they only claiming it to, that sort of thing.

It is a day/weekend worth of work, but I think this is the best way to determine if the model fits your need specifically. This is what lead me to finding out that qwen 27b outperformed even kimi on those tasks, and that opus tries gaslighting me when I give it a spec of something that has been proven, but no published solution exists online. All other models gave their best shot at solving it, opus just said it's not possible (even when I gave it the finished working product that obviously works).

Especially for small models (but also big ones) I think the only way to know if a model will improve your workflow is this, personal benchmarks, expanded over time, ran in private.

New comment by sleepyeldrazi in "Qwen 3.7 Preview"

sleepyeldrazi — Mon, 18 May 2026 17:32:20 +0000

I don't think I can handle another small model release by qwen, I'm still trying to find the limits of 3.6 27B and they are already threatening us with a new one?

But jokes aside, I love the fast iteration, these are most probably again finetunes on the 3.5 architecture that appear better in internal testing, which is still very nice to see. Putting more and more pressure on the bigger labs to perform better is always a good thing.

New comment by sleepyeldrazi in "Apple Silicon costs more than OpenRouter"

sleepyeldrazi — Sun, 17 May 2026 18:18:59 +0000

I feel like if I had the infrastructure and saw that there is a huge interest in the model, i'd just undercut alibaba's prices a little harder to grab all the consumers. I am sure that the providers have done the math and found that there is a reason not to do this (compute-bound if too many users?), but the delta is very stark, especially for output. Last I checked the cheapest 27b on openrouter was 2$ out vs 0.38$ for the 31b.

But I do agree that the openrouter prices aren't a strong signal and probably should have worded it a little better. It's just a really stark and 'in your eyes' gap.

New comment by sleepyeldrazi in "Apple Silicon costs more than OpenRouter"

sleepyeldrazi — Sun, 17 May 2026 14:14:12 +0000

If you want a good dense model, use qwen3.6 27B instead, speed will be up, and if you don't take my word for it being smarter, take openrouter's prices of it against the bigger, slower and less memory-efficient gemma do the talking.

If you want a faster model, go for qwen3.6 35B (or gemma 4 26B if gemma models perform better for your tasks). There is a reason why people (myself included) haven't shut up about those two (especially the 27B). Its small enough to run at a decent speed (especially with the built in MTP that finally has official llama.cpp support) and for many workloads (every benchmark I have ever thrown at it) it is matching or surpassing models it has no right to.

A couple of days ago I woke up with my internet being down, started 27B in pi, told it to diagnose whats wrong by giving it my router's password, went to grab a coffee and by the time I got back, i had a full report with suggestion on how to proceed. I love openrouter and I use it for many things, but it is not cheaper.

Subjectivity and opinions based on personal experience with all those models implied naturally, I assume the 31B gemma has cases in which it edges out, I've just failed finding any and I have been running all 4 models mentioned since hours after each of them dropped nonstop for different tasks. Hell, for my hermes, I've started getting better results once I switched from gemma 4 26B to qwen3.5 9B, not even the massively improved 3.6 series. It just feels outdated/ cherrypicked to not use what by many accounts is the current consumer hardware SOTA if doing such an analysis.

New comment by sleepyeldrazi in "Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution"

sleepyeldrazi — Sat, 16 May 2026 20:04:16 +0000

It is actually very exciting that they are also working on 3.5, I will keep this toy project up in the meantime, trying it out and testing things around it helps me learn a bunch.

As for the treating them as a block idea, that was my initial plan, but the GatedDeltaNet is doing most of the work in 3.5. Trying to bundle them together would hurt acceptance rates drastically, potentially making the speed benefits not a lot bigger, or smaller, than the native MTP.

New comment by sleepyeldrazi in "Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution"

sleepyeldrazi — Sat, 16 May 2026 18:51:17 +0000

Think of this as another way of achieving that. This theoretically has a higher ceiling of how much it can predict at a time. And more importantly is a lot more memory efficient during actual inference.

New comment by sleepyeldrazi in "Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution"

sleepyeldrazi — Sat, 16 May 2026 18:02:11 +0000

If anyone is interested in watching my 0.8B experiments: https://orthrus.kokoham.com/ . The current code is here: https://git.kokoham.com/sleepy/qwen_orthrus .

The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.

Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.

New comment by sleepyeldrazi in "Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution"

sleepyeldrazi — Sat, 16 May 2026 09:23:06 +0000

My plan is to validate it first using qwen3.5 0.8B if it even works (as it has the same architecture as qwen3.6 27b, just scaled down a bit) on my 3090. If it does, I'll make a git about the process if anyone wants to use my approach, while I try to convince my uni to lend me h100s for a day.

New comment by sleepyeldrazi in "Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution"

sleepyeldrazi — Sat, 16 May 2026 07:39:07 +0000

Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.

New comment by sleepyeldrazi in "Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution"

sleepyeldrazi — Sat, 16 May 2026 07:18:28 +0000

From a quick and shallow view of the paper, it looks very feasible (with a little tinkering ) to be adapted to qwen3.6 27B. The process looks somewhat similar to training a LoRA, or in a way distilling your own model so that a mini model learns how to imitate it, and you glue them. I might bite the bullet and rent a gpu to do it for 3.6 27b, as this will solve a lot of my problems.

New comment by sleepyeldrazi in "Show HN: Find the best local LLM for your hardware, ranked by benchmarks"

sleepyeldrazi — Fri, 15 May 2026 11:03:08 +0000

I love this community, I started building a simple website for this exactly a couple of hours ago and you made an even more advanced version already. Hats off to you sir.

If i ever decide to actually publish the site, is it alright if I mention you somewhere as a "If you want a more accurate estimation, check out this project:", as i think there is value in having a simple website estimate this information for you, and give you instructions/ common flags on how to start it yourself (also a prompt crafted for you to optionally give to an llm to set it up for you), but im going off simple "choose an os, gpu/vram, here's a list of options" and not actually scanning (which is a lot more accurate).

New comment by sleepyeldrazi in "Running local LLMs offline on a ten-hour flight"

sleepyeldrazi — Tue, 28 Apr 2026 07:29:58 +0000

I specifically tested on tasks I designed because I know every modern model, not only local ones, are bechmaxxed. The common benchmarks most labs use are (very likely) in their datasets to a degree (I'm assuming unintentionally, but is still highly probable) and there was a recent report on how easy it is to actually cheat them, as shown by people at UC Berkeley https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

That is precisely why my testing has been daily driving the model for everything + 8 tasks in a domain I care about. Could there be something very similar in their datasets? Of course, at least for most of the tasks, but if that lead to the good performance experience and results I'm getting, I am personally ok with that. I don't care how high the numbers are on the common benchmarks, only if it works well enough for me.

And if this model doesn't work for you, that's perfectly ok. Everyone has different needs from models. I was just impressed that it did for me, as it was a first from a local model.

New comment by sleepyeldrazi in "Running local LLMs offline on a ten-hour flight"

sleepyeldrazi — Mon, 27 Apr 2026 21:16:54 +0000

I haven't honestly dug around to figure out if there's a hardware reason for it, but prompt processing has always been a lot slower for me on macs in general. I mostly use MLX on my 24GB M4 Pro though, so I will pull llama.cpp on it as well to see what the prefill is like.

I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx).

Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill.

I would be very interested to know your prefill and generation numbers.

New comment by sleepyeldrazi in "Running local LLMs offline on a ten-hour flight"

sleepyeldrazi — Mon, 27 Apr 2026 17:11:18 +0000

I have been testing and using Qwen3.6 27B (running from my 3090) since it dropped and I genuinely think this is the first consumer hardware-grade model that can actually replace frontiers for a lot of workloads.

I ran 8 tests on a variety of open-weights models, and opus 4.7 (1mil ctx version) and the little dense model was right behind it: https://github.com/sleepyeldrazi/llm_programming_tests/tree/... Of note is that opus was the only model to push back against the spec on the hardest challenge, saying 'thats not possible', when there are links in the spec to examples of it being done.

There may be problems with the mlx versions, as i haven't had any looping in all the testing i've done, which is all my agentic and coding work the last couple of days (since it dropped). I have had tool_call misses 4 or 5 times so far, which isn't ideal but no looping. First I used it in pi-mono and later when i realized it's a serious model switched to opencode.

My setup is llama.cpp running on a 3090 in WSL, unsloth IQ4_NL with those flags: --ctx-size 128000 \ --jinja \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ --threads 12 \ --gpu-layers 99 \ --no-warmup \ --no-mmap \ -fa on

New comment by sleepyeldrazi in "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model"

sleepyeldrazi — Thu, 23 Apr 2026 09:33:01 +0000

I have been getting good results with IQ4_NL and TurboQuant at 4bits on 24gb (3090). It easily fits 256k with that setup, but it starts slowing down quite a bit after 80-100k. Quality in my testing is also still good:

- Coding task test: https://github.com/sleepyeldrazi/llm_programming_tests/ - Design task test: https://github.com/sleepyeldrazi/llm-design-showcase

Coding was against minimax-m2.7 and glm-5, and the design against other small models

New comment by sleepyeldrazi in "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model"

sleepyeldrazi — Thu, 23 Apr 2026 09:27:06 +0000

I ran 3 prompts (short versions, full version in the repo):

- Implement a numerically stable backward pass for layer normalization from scratch in NumPy.

- Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode).

- Implement an efficient KV-cache system for autoregressive transformer inference from scratch.

and tested Qwen3.6-27B (IQ4_NL on a 3090) against MiniMax-M2.7 and GLM-5 with kimi k2.6 as the judge (imperfect, i know, it was 2AM). Qwen surpassed minimax and won 2/3 of the implementations again GLM-5 according to kimi k2.6, which still sounds insane to me. The env was a pi-mono with basic tools + a websearch tool pointing to my searxng (i dont think any of the models used it), with a slightly customized shorter system prompt. TurboQuant was at 4bit during all qwen tests. Full results https://github.com/sleepyeldrazi/llm_programming_tests.

I am also periodically testing small models in a https://www.whichai.dev style task to see their designs, and qwen3.6 27B also obliterated (imo) the other ones I tested https://github.com/sleepyeldrazi/llm-design-showcase .

Needless to say those tests are non-exhaustive and have flaws, but the trend from the official benchmarks looks like is being confirmed in my testing. If only it were a little faster on my 3090, we'll see how it performs once a DFlash for it drops.

The only non-LLM-generated file in my repo

sleepyeldrazi — Fri, 10 Apr 2026 15:37:58 +0000

Article URL: https://github.com/sleepyeldrazi/little_helper_tui/blob/main/letter.md

Comments URL: https://news.ycombinator.com/item?id=47719744

Points: 2

# Comments: 1

New comment by sleepyeldrazi in "Xbox co-pilot mode changed disabled sister’s life"

sleepyeldrazi — Thu, 03 Jun 2021 13:09:27 +0000

If he can comfortably move his head, might I suggest something VR related? Most of the first VR games that came out for mobile (i.e. using your phone as a VR screen) couldn't utilise a controller so they used head movement for controlling things. Usually you had e small crosshair in the middle with which you can select things by simply holding a position. I sadly can't remember the names of the games I tried but you might find them (or newer ones for newer headsets) through google. I know it's not a complete substitute but if they miss mountain walks, there are VR experiences (often videos) which simulate that. Might be interesting to try. I'm not aware of it but there might even be multiplayer ones, so that his wife can join him.