Hacker News: coder543

New comment by coder543 in "DeepSeek V4 Flash 0731 Intelligence, Performance and Price Analysis"

coder543 — Fri, 31 Jul 2026 12:47:09 +0000

The weights were just released a few minutes ago: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-0731

New comment by coder543 in "Kimi K3 is not cheap"

coder543 — Sun, 26 Jul 2026 20:44:05 +0000

But if you compare to Anthropic's models? The cost difference is huge. Anthropic is clearly concerned that people are realizing they are expensive, since the Opus 5 blog post dedicated a lot of time to talking about how cheap the model was compared to the competition... but this doesn't hold water when I haven't seen any independent benchmarks claiming Opus 5 is cheaper than GPT-5.6-Sol, even if it is supposedly closer.

GPT-5.6-Sol is pretty competitively priced, but not all American frontier models are, and even 10% to 30% is still significant for any commodity that's as fungible as frontier models often are.

> as you say that could go down to 20-30% cheaper

I never said anything about 20% to 30%. We don't know how much it actually costs to host this model yet, and that will determine the final price. It could be just a little less, or it could be a lot less.

> once you account for quantisation

There will be no need to account for quantization. Kimi models have been 4-bit only since at least K2.5. They don't release or serve models in higher precision than that. This isn't one of those situations where LLM inference providers are debating between serving 16-bit, 8-bit, or 4-bit, and I have never seen a publicly hosted, paid model that was hosted in less than 4-bit, even if hobbyists will use sub-4-bit quantizations sometimes locally.

New comment by coder543 in "Kimi K3 is not cheap"

coder543 — Sun, 26 Jul 2026 19:55:46 +0000

This article seems premature to post. Right now, the price is arbitrarily set by a single provider. Why wouldn't Moonshot collect extra revenue during this exclusivity period when they knew there would be hype?

The model weights are supposed to release tomorrow.

Over the next several weeks, I would expect competition among open weight providers to drive down the cost, as I've seen happen with other open weight model releases.

New comment by coder543 in "Apple's new SpeechAnalyzer API, benchmarked against Whisper and its predecessor"

coder543 — Mon, 13 Jul 2026 16:42:22 +0000

At this point, I would not recommend ignoring Parakeet TDT 0.6b v2/v3 (english-only versus multilingual). Those models have been out for a year, give or take, and they are both accurate and fast. I would choose Parakeet over Whisper in almost all situations these days. Parakeet works great even on my several year old iPhone 15 Pro Max, so if an app is going to ship a dedicated model, I strongly recommend investigating Parakeet.

On the more cutting edge front, Granite Speech 4.1 has proven to be a reliable workhorse for me, but it is larger than Parakeet. Cohere Transcribe is interesting, but how strong it is seems to vary more from task to task.

Parakeet Unified 0.6B came out a few months ago, combining both online streaming and offline transcription into one model, and that is one that I need to test more, but it seems promising.

As others have mentioned macOS 27/iOS 27 is supposed to have a new model, particularly on devices with 12GB of RAM or more. I have not actually seen the option to enable that new model yet, though, despite being on the beta on a device that meets the requirements. Maybe a benchmark would reveal that it is already active?

New comment by coder543 in "Qwen 3.6 27B is the sweet spot for local development"

coder543 — Wed, 01 Jul 2026 13:17:22 +0000

It’s not FUD. It is my actual, lived experience. FUD is false, which this is not.

I use both vLLM and llama-server. vLLM is very painful, even with the Spark community docker image. It is slow to start, it does not support 3-bit dynamic quants well, and it takes a lot of tweaking to get it to run well for each model I want to try out, which is made worse by the slow starts.

I’m glad you’ve had a better experience? I can only speak to the experiences that I have had repeatedly. For at least a month, people on the official Spark forum were claiming you just couldn’t run MiMo-V2.5 on a single Spark, because they refused to use anything other than vLLM, while I was doing it just fine on llama-server with 200k+ of context.

And llama-server is “worse” in what specific ways? I was specific with my comment. The usual complaint was the lack of MTP/Eagle3 support in llama-server, but that is solved now. Now the main difference is a minor hit to prompt processing speed, at most, if you’re using a single Spark.

Too many people on the Spark forum are closed minded to the idea that vLLM is not the solution to every problem.

llama-server also comes with a truly excellent built-in web chat interface these days, which includes the ability to connect to MCPs so the models can be used agentically through a conversational interface even from my phone. What does vLLM offer? Yeah… nothing. And options like Open WebUI seem really bloated.

For a cluster of multiple Sparks, the pain of vLLM is still worthwhile, as I already said before. Or if you’re running some kind of major production workload, I guess? Instead of a single user, few agent setup like most people.

New comment by coder543 in "Qwen 3.6 27B is the sweet spot for local development"

coder543 — Mon, 29 Jun 2026 23:57:48 +0000

Unsloth Studio is also very low effort, and a lot better than LM Studio in my opinion. (Performance, compatibility with Gemma 4, actually open source, etc.)

New comment by coder543 in "Qwen 3.6 27B is the sweet spot for local development"

coder543 — Mon, 29 Jun 2026 23:17:19 +0000

Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated.

I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.

As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.

Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.

New comment by coder543 in "Qwen 3.6 27B is the sweet spot for local development"

coder543 — Mon, 29 Jun 2026 23:01:28 +0000

> The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE

If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.

With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)

I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.

New comment by coder543 in "Qwen 3.6 27B is the sweet spot for local development"

coder543 — Mon, 29 Jun 2026 22:57:54 +0000

> For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4

Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.

New comment by coder543 in "Previewing GPT‑5.6 Sol: a next-generation model"

coder543 — Fri, 26 Jun 2026 19:31:12 +0000

5.5 Pro is $30 in / $180 out: https://developers.openai.com/api/docs/pricing

I think you meant 5.5.

I agree it is probably the same size model. It's probably exactly built on top of 5.5, just with more training, or else they would have bumped the version number to 6.

New comment by coder543 in "OpenAI unveils its first custom chip, built by Broadcom"

coder543 — Thu, 25 Jun 2026 15:12:52 +0000

EDIT: It's just not even worth arguing this point, so deleting my original, much longer comment. Abstract taxonomies can claim that Taalas is CIM, but this entirely and utterly misses the point, and misses what makes Taalas' approach special. If you told a room full of chip architects to go build "CIM for AI", they would not build a Taalas-like totally specialized chip, therefore it is not sufficient, and just muddies the conversation from my point of view. People have been doing "CIM" for decades and yet I've never seen anyone build a totally specialized chip at the scale of Taalas. And yes, you can (in theory) build an analog version of any computer, so of course you can build analog CIM, but "analog compute" is not inherently CIM, so conflating the two is just confusing.

New comment by coder543 in "OpenAI unveils its first custom chip, built by Broadcom"

coder543 — Thu, 25 Jun 2026 11:15:21 +0000

CIM does not bake the weights into silicon. The level of optimization that you can do down to the last transistor when the weights are fixed is on an entirely different level than CIM where you still need general purpose ALUs all over the place.

New comment by coder543 in "OpenAI unveils its first custom chip, built by Broadcom"

coder543 — Thu, 25 Jun 2026 02:51:08 +0000

Yes, I’m focused on the topic at hand that the person I replied to was also talking about.

The person I replied to was acting as if Taalas was ancient history. I was pointing out it has only been a few months.

New comment by coder543 in "OpenAI unveils its first custom chip, built by Broadcom"

coder543 — Thu, 25 Jun 2026 02:19:21 +0000

> It's odd to me that I haven't heard anything about this approach since.

It has only been four months since they unveiled their first prototype. I don't understand your confusion. Chip development does not happen overnight...?

Their initial blog post laid out a roadmap, so theoretically they should have another thing to demonstrate this summer.

New comment by coder543 in "OpenAI unveils its first custom chip, built by Broadcom"

coder543 — Thu, 25 Jun 2026 02:16:25 +0000

Taalas' first chip is for a Llama 3.1 8B quant, not a 3.1B parameter model, to clarify.

New comment by coder543 in "Elevated error rate across multiple models"

coder543 — Tue, 23 Jun 2026 18:34:59 +0000

Yeah... one of the relevant issues: https://github.com/openai/codex/issues/11940#issuecomment-45...

You would think they would support their own GPT-OSS model, but, not really anymore. I wish they would release a GPT-OSS 2, but this doesn't fill me with confidence.

New comment by coder543 in "Elevated error rate across multiple models"

coder543 — Tue, 23 Jun 2026 17:52:32 +0000

Well, the reason is simple: over the past several months, it has become very difficult to use Codex with non-OpenAI models. They removed the old edit tool that didn't require OpenAI's free form tool calling (that no other LLM host supports), they are adding tools to every request of a type that break most LLM hosts unless you use a proxy to filter them out, they add a "developer" role to some messages which breaks some chat templates, etc.

If someone wanted to fork Codex and make a community-maintained version that supports third party models, that would be great, because I liked Codex better than OpenCode for the most part.

Maybe you've found workarounds. Maybe you're using an old version of Codex. Maybe you have your own soft fork. I don't know. But I used to be able to use Codex with self-hosted models, and I gave up on that about a month ago as they kept breaking that.

New comment by coder543 in "Will It Mythos?"

coder543 — Tue, 23 Jun 2026 17:39:40 +0000

> I have come to consider Gemma 4 31b the best model I can self-host

I'm confused. Your own results show that Gemma 4 26B A4B and Qwen3.6-27B did better in these tests?

I really like Gemma 4 31B, especially with how exceptionally good its MTP drafter is, but it is absurdly weak at tool calling and instruction following in my testing, and its smaller siblings are even worse at this. If the system prompt says to do something, Gemma 4 31B will very often ignore that entirely. It will also make fewer tool calls than were needed to solve a problem, so then it fails. The Qwen3.6 series is much, much more reliable for carrying out instructions and doing agentic tasks in my testing, although they can get stuck in loops.

There is a lot of potential in the Gemma 4 series, but I think Google needs to release a Gemma 4.1 update to polish the rough edges. Unfortunately, if Gemma 3's lifecycle is any indication, Google won't release a true revision of the Gemma 4 models, even if they release a bunch of specialized research models based on Gemma 4 over the next year.

New comment by coder543 in "Apertus – Open Foundation Model for Sovereign AI"

coder543 — Sun, 21 Jun 2026 22:32:49 +0000

Is a recipe useful if no one likes it?

There are equally open, much more useful models out there: https://artificialanalysis.ai/?models=nvidia-nemotron-3-ultr...

New comment by coder543 in "GLM-5.2 is the new leading open weights model on Artificial Analysis"

coder543 — Wed, 17 Jun 2026 18:15:31 +0000

Claude Sonnet 4.6 identified itself as DeepSeek repeatedly: https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...

I tested this myself a few months ago, and confirmed that it was really happening.

LLMs don't know who they are unless the system prompt tells them, and as all of them are trained on model responses that exist on the web that end up being scraped, the weights may predict a certain incorrect response. LLMs have no ability to introspect, and do not know anything about themselves, so they will hallucinate in response to that question unless they are carefully trained on that exact, pointless question.