Hacker News: Patrick_Devine

New comment by Patrick_Devine in "Gemma 4 12B: A unified, encoder-free multimodal model"

Patrick_Devine — Wed, 03 Jun 2026 22:22:57 +0000

Given the model was just republished by Google 15 minutes ago and we're going to have to redo everything (and everyone will have to redownload for all platforms -- not just Ollama), I'll just say that sometimes things don't work out exactly the way you want them to. :-D

That said, I think the gemma4:12b-nvfp4 model is pretty solid. It's been tuned with Nvidia's model optimizer. I've been waiting on the results for MMLU-Pro, but I'll have to retrigger that after reconverting.

New comment by Patrick_Devine in "Gemma 4 12B: A unified, encoder-free multimodal model"

Patrick_Devine — Wed, 03 Jun 2026 20:30:15 +0000

I realize this is a little confusing; we're working w/ the MLX team to bring MLX to other platforms, but we're not quite there yet. The `gemma4:12b-nvfp4` model is specifically for the MLX engine.

For the GGUF 4bit variant (i.e. non-macs) you'll need `gemma4:12b-it-q4_K_M` which I just pushed. You'll also need to upgrade to version 0.30.4 which we're just about to release (it's in prerelease and we're running through our last regression tests).

New comment by Patrick_Devine in "Gemma 4 12B: A unified, encoder-free multimodal model"

Patrick_Devine — Wed, 03 Jun 2026 20:19:33 +0000

I haven't yet pushed the MTP enabled gemma4 12b model for Ollama because in my testing I wasn't getting a performance bump. The other gemma4 MTP models should work OK right now, but there are some fixes we're just about to push. This is specifically for the MLX backend.

New comment by Patrick_Devine in "Accelerating Gemma 4: faster inference with multi-token prediction drafters"

Patrick_Devine — Tue, 05 May 2026 18:17:46 +0000

In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance.

You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.

New comment by Patrick_Devine in "SFO Quiet Airport (2025)"

Patrick_Devine — Fri, 24 Apr 2026 18:54:25 +0000

I wish they would do this when you're boarding the plane. I get that there is essential information that everyone needs to know, but if you're a frequent flier you've probably heard the "put your larger carry-on in the overhead bin and your smaller bag underneath the seat in front of you" hundreds, if not thousands of times.

New comment by Patrick_Devine in "All 12 moonwalkers had "lunar hay fever" from dust smelling like gunpowder (2018)"

Patrick_Devine — Fri, 17 Apr 2026 19:59:58 +0000

Isn't this why NASA is developing the Electrodynamic Dust Shield [1] system?

[1] https://www.nasa.gov/image-article/nasas-dust-shield-success...

New comment by Patrick_Devine in "Qwen3.6-35B-A3B: Agentic coding power, now open to all"

Patrick_Devine — Thu, 16 Apr 2026 17:47:25 +0000

If you're on a Mac, use the MLX backend versions which are considerably faster than the GGML based versions (including llama.cpp) and you don't need to fiddle with the context size. The models are `qwen3.6:35b-a3b-nvfp4`, `qwen3.6:35b-a3b-mxfp8`, and `qwen3.6:35b-a3b-mlx-bf16`.

New comment by Patrick_Devine in "Ollama is now powered by MLX on Apple Silicon in preview"

Patrick_Devine — Tue, 31 Mar 2026 22:41:35 +0000

They are nvidia-fp4 weights, but CUDA support isn't _quite_ ready yet, but we've got that cooking.

New comment by Patrick_Devine in "Ollama is now powered by MLX on Apple Silicon in preview"

Patrick_Devine — Tue, 31 Mar 2026 22:39:22 +0000

The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.

New comment by Patrick_Devine in "Ollama is now powered by MLX on Apple Silicon in preview"

Patrick_Devine — Tue, 31 Mar 2026 22:31:41 +0000

Try it with mxfp8 or bf16. It's a decent model for doing tool calling, but I wouldn't recommend using it with 4 bit quantization.

New comment by Patrick_Devine in "Mac mini will be made at a new facility in Houston"

Patrick_Devine — Tue, 24 Feb 2026 23:29:31 +0000

I noticed the same thing. I'm assuming they forgot to photoshop out the chinese characters.

New comment by Patrick_Devine in "I converted 2D conventional flight tracking into 3D"

Patrick_Devine — Wed, 18 Feb 2026 02:23:41 +0000

The Departing / Arrival airports plus a full track would be absolutely amazing.

New comment by Patrick_Devine in "IBM CEO says there is 'no way' spending on AI data centers will pay off"

Patrick_Devine — Wed, 03 Dec 2025 00:38:47 +0000

5 years is normal-ish depreciation time frame. I know they are gaming GPUs, but the RTX 3090 came out ~ 4.5 years before the RTX 5090. The 5090 has double the performance and 1/3 more memory. The 3090 is still a useful card even after 5 years.

New comment by Patrick_Devine in "Mistral 3 family of models released"

Patrick_Devine — Tue, 02 Dec 2025 21:59:13 +0000

The instruct models are available on Ollama (e.g. `ollama run ministral-3:8b`), however the reasoning models still are a wip. I was trying to get them to work last night and it works for single turn, but is still very flakey w/ multi-turn.

New comment by Patrick_Devine in "Claude Is Down"

Patrick_Devine — Sat, 08 Nov 2025 17:50:31 +0000

The default ones on Ollama are MXFP4 for the feed forward network and use BF16 for the attention weights. The default weights for llama.cpp quantize those tensors as q8_0 which is why llama.cpp can eek out a little bit more performance at the cost of worse output. If you are using this for coding, you definitely want better output.

You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.

New comment by Patrick_Devine in "Gemma 3 270M: Compact model for hyper-efficient AI"

Patrick_Devine — Fri, 15 Aug 2025 17:05:39 +0000

We uploaded gemma3:270m-it-q8_0 and gemma3:270m-it-fp16 late last night which have better results. The q4_0 is the QAT model, but we're still looking at it as there are some issues.

New comment by Patrick_Devine in "Ollama Turbo"

Patrick_Devine — Tue, 05 Aug 2025 19:54:50 +0000

Ollama only uses llamacpp for running legacy models. gpt-oss runs entirely in the ollama engine.

You don't need to use Turbo mode; it's just there for people who don't have capable enough GPUs.

New comment by Patrick_Devine in "Ollama's new engine for multimodal models"

Patrick_Devine — Fri, 16 May 2025 06:40:03 +0000

I worked on the text portion of gemma3 (as well as gemma2) for the Ollama engine, and worked directly with the Gemma team at Google on the implementation. I didn't base the implementation off of the llama.cpp implementation which was done in parallel. We did our implementation in golang, and llama.cpp did theirs in C++. There was no "copy-and-pasting" as you are implying, although I do think collaborating together on these new models would help us get them out the door faster. I am really appreciative of Georgi catching a few things we got wrong in our implementation.

New comment by Patrick_Devine in "Ollama's new engine for multimodal models"

Patrick_Devine — Fri, 16 May 2025 05:21:37 +0000

Wait, what hosted APIs is Ollama wrapping?

New comment by Patrick_Devine in "Gemma 3 QAT Models: Bringing AI to Consumer GPUs"

Patrick_Devine — Mon, 21 Apr 2025 17:57:03 +0000

The vision tower is 7GB, so I was wondering if you were loading it without vision?