Hacker News: josefcub

New comment by josefcub in "How good is Mac Studio M3 Ultra for Trillion param models like DeepSeekv4?"

josefcub — Sat, 25 Apr 2026 02:34:11 +0000

I've got 256GB of RAM on a Mac Studio M3 Ultra. Other posters are right: The M3 Ultra's prefill is super slow with really large models, 3-5 minutes while it digests the new additions to its context before it continues. On my heavy RAM model, I _can_ run 400b-500b models at Q2, and up to about 750b models at Q1, but the wait isn't the worst part.

Lower quants like that affect its output, making it less capable overall and letting it easily forget things.

Here's what I'd do with 96GB of RAM: Run Qwen 3.6 35b-a3b at Q8 for coding/agentic tasks. You'll get around 70tokens generated per second, the prefill is lightning fast in comparison, and you'll get a lot of work done. Qwen 3.6 27b is out now too, and I'm getting 17tok/sec token generation with a slower prefill.

The upshot is that you'll still have 20-40GB of RAM left for your workstation and development loads. Running Qwen 3.6 35b or 27b at Q8 quantization, the model at 128k context uses about 40GB of RAM; my OS and application load uses 20-30GB most of the time, for a total of 60-70. That's plenty of room in memory for you to work _and_ run inference.

You _may_ end up getting Deepseek 4 Flash running, but it'll be a lower quantization like Q2 or Q3, making it kind of dumb in comparison. And you may not have enough memory left over for any appreciable amount of context. Working with today's reasoning models needs context for it to generate and give out good answers. Doubly so for agentic/coding tasks.

New comment by josefcub in "Best Open Source Offline AI Agent"

josefcub — Thu, 09 Apr 2026 17:47:07 +0000

My google-fu is failing me at the moment to cite sources, but here's an example ~/.config/crush/crush.json file (based on my own) showing the options to remove telemetry and provider auto updates, and the connection info to connect to a localhost model on an OpenAI-compatible endpoint:

{ "$schema": "https://charm.land/crush.json", "options": { "disable_provider_auto_update": true, "disable_metrics": true }, "providers": { "ollama": { "name": "Local Models", "base_url": "http://localhost:11434/v1", "api_key": "nunya", "type": "openai-compat", "models": [ { "name": "Qwen 3.5 Local", "id": "qwen-3.5-35b-planning", "cost_per_1m_in": 0.01, "cost_per_1m_out": 0.01, "context_window": 131072, "think": true, "default_max_tokens": 5120, "supports_attachments": true } ] } } }

...or not, thanks to formatting. I can't even search for help formatting this text box, because of HN's nature haha

New comment by josefcub in "Best Open Source Offline AI Agent"

josefcub — Thu, 09 Apr 2026 14:58:00 +0000

Try charmbracelet's crush, found here:

https://github.com/charmbracelet/crush

Crush is pretty new, but getting better all the time. It's written in Go, so no node hijinks to get it working. It works fine with my ollama or llama-server localhost endpoints, and I've used it to make up a couple of internal projects without any issues.

It does have internal telemetry and such (including updating its list of external models it can use) that can be turned off in the crush.json configuration file.

If you're on a Mac, you can install via homebrew or use the more traditional route via Github.

New comment by josefcub in "Ask HN: Anyone Using a Mac Studio for Local AI/LLM?"

josefcub — Fri, 06 Feb 2026 17:09:57 +0000

I am! I moved from a shoebox Linux workstation with 32MB of RAM and a 12GB RTX 3060 to a 256GB M3 Ultra, mainly for unified memory.

I've only had it a couple of months, but so far it's proving its worth in the quality of LLM output, even quantized.

I generally run Qwen3-vl at 235b, at a Q4_K_M quantization level so that it fits, and it leaves me plenty of RAM for workstation tasks while delivering tokens at around 30tok/s

The smaller Qwen3 models (like qwen3-coder) I use in tandem, of course they run much faster and I tend to run them at higher quants up to Q8 for quality purposes.

The gigantic RAM's biggest boon, I've found, is letting me run the models with full context allocated, which lets me hand them larger and more complicated things than I could before. This alone makes the money I spent worth it, IMO.

I did manage to get glm-4.7 (a 358b model) running at a Q3 quantization level; it's delivery is adequate quality-wise, although it delivers at 15tok/s, though I did have to cut down to only 128k context to leave me enough room for the desktop.

If you get something this big, it's a powerhouse, but not nearly as much of a powerhouse as a dedicated nVidia GPU rig. The point is to be able to run them _adequately_, not at production speeds, to get your work done. I found price/performance/energy usage to be compelling at this level and I am very satisfied.

New comment by josefcub in "Show HN: AI or Human"

josefcub — Tue, 07 Oct 2025 02:28:35 +0000

I think this is fun! I ended up with 92/102, because a few were non-obvious. If I had a critique it's that the AI entries were repeated on multiple occasions, making the decision between them trivial.

If you can find some public domain literature from the 20th century, it would be a much harder game.

New comment by josefcub in "Ask HN: What are you working on this weekend?"

josefcub — Tue, 20 Jun 2023 02:34:35 +0000

I'd listen to that on the radio! LOL

2.3L 4-cylinder. Good enough for the light truck tasks I usually end up having. It's the last year of the first generation Rangers. I had a 1988 model with a 4 cylinder when I was young, and -that- one trucked along through all the youthful abuse I threw at it. I'm really looking forward to enjoying this one too.

New comment by josefcub in "Ask HN: What are you working on this weekend?"

josefcub — Fri, 16 Jun 2023 15:36:25 +0000

I'm going to buck trends and get my hands dirty -- The oil needs changed and the O2 sensor needs replaced on an old (1992) Ford Ranger pickup truck I recently acquired. Bought it for a song, and it's definitely not a looker but it's still going strong (especially after the small age-related repairs I'm doing).