Hacker News: lambda

New comment by lambda in "OpenAI and Hugging Face address security incident during model evaluation"

lambda — Tue, 21 Jul 2026 21:13:54 +0000

They do maintain the Transformers library which is pretty much the core library for how you interact with LLM models in the open source world. So while they weren't using a model they've trained, they were a part of making just about all of the open models (maybe excluding OpenAI and Google's, I wouldn't be surprised if they have their own frameworks that predate the Transformers library).

New comment by lambda in "Gemini 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber"

lambda — Tue, 21 Jul 2026 16:06:10 +0000

3.6 Flash scores exactly the same as 3.5 Flash on the Artificial Analysis index. Better on some tasks, worse on others. Mostly within what I'd consider the noise window. Looks pretty much indistinguishable from 3.5 Flash, at least on these benchmarks: https://artificialanalysis.ai/models/gemini-3-6-flash

New comment by lambda in "China’s open-weights AI strategy is winning"

lambda — Mon, 20 Jul 2026 18:59:07 +0000

> I think that in 10-15 years, we are going to have consumer PCs (and phones!) running models doing pretty much anything that frontier models can do right now

10-15 years? The current rate is closer to 10-15 months.

15 months ago, the top model on the Artificial Analysis index was GPT-o3. It scores 30 on the Artificial Analysis index.

Today, you can easily run Qwen 3.6 27B on a variety of consumer hardware. It scores 37 on that index.

Here are a number of open weights models that you can run locally compared with the frontier class models from 7 to 15 months ago: https://artificialanalysis.ai/?models=o3%2Co3-pro%2Cclaude-4...

I've run all of these models on my laptop (Strix Halo, 128 GiB of unified RAM); the bigger ones, like MiniMax M2.7 and DeepSeek V4 Flash, need to be done at fairly aggressive quants that will certainly lose some performance and not quite hit the performance of the unquantized models. But still, it's definitely the case that you can run models that are competitive with the frontier models of 10-15 months ago on consumer laptops.

Heck, just announced though the weights haven't yet been released for independent confirmation is MiniCPM5-2B, a 2 billion parameter (small enough to run on your phone) model, that according to their benchmarks has performance competitive with GPT-4o, a frontier class model from 2024.

https://nitter.net/i/status/2079088670804767114

So that's around 1 year for frontier to consumer device class, 2 years from frontier to phone.

Now, this kind of rate won't necessarily keep up; it's possible that local models will hit a performance ceiling before frontier models do. There's only so much information you can cram into a certain number of bytes, and the AI boom is causing hardware prices to skyrocket so keeping consumer hardware from advancing quite as fast as it had been.

New comment by lambda in "Fable 5 vs. GPT-5.6 Sol on an NP-Hard Problem: Does /goal help?"

lambda — Sat, 18 Jul 2026 16:53:58 +0000

The article describes it.

Both Codex and Claude Code have it, but they work slightly differently.

Claude Code uses Haiku to read through the transcript and decide if the goal has been completed. If not, Haiku injects a prompt back to the main model to indicate what still needs to be done.

In Codex, instead it's a tool available to the main model, plus some part of the surrounding harness that will re-prompt it if the tool calls haven't yet indicated that the goal is complete.

The issue that they are trying to solve is that sometimes models will stop before they have actually fully completed whatever task they were given; attention isn't perfect, and someitmes they'll complete part of it but not the whole task. Rather than making the user come back and re-prompt to keep going, they add a way to automatically do a bit more nudging to try to get the model to finish the task.

New comment by lambda in "Kimi K3, and what we can still learn from the pelican benchmark"

lambda — Fri, 17 Jul 2026 18:57:39 +0000

I've tried doing a loop of rending the SVG and then tweaking based on that, with local models (so, not nearly as strong). It wasn't very successful; it would mostly report that the image looked great and didn't need any tweaks. Maybe I should try it again, there have been some newer models since I first tried it. And yeah, maybe worth trying with bigger models. But I have found that models aren't necessarily the best at visual reasoning and review, even with a vision loop. Their lack of visual reasoning is part of why they still have trouble with things like ARC-AGI-3.

New comment by lambda in "Kimi K3: Open Frontier Intelligence"

lambda — Thu, 16 Jul 2026 19:17:07 +0000

You can always ask them to draw something else, as a way to avoid any possible pelican related data contamination; given how popular the pelican test is, I'm sure there's some pelican SVG drawing in the training sets of at least some of these models by now. For instance, you could ask for an SVG drawing of a cyborg bear riding a rocket powered unicycle.

It's a silly fun little benchmark, and because Simon's been doing it for so long, you have a lot of examples over the years to compare. But you can always come up with and run your own test with other drawings.

New comment by lambda in "How to stop Claude from saying load-bearing"

lambda — Tue, 14 Jul 2026 15:30:17 +0000

It's really frustrating, because now when I want to write something like a "not X but Y" or "you're absolutely right," I have to stop and decide if I want to self-censor to avoid sounding like a bot.

Sometimes those constructs are actually useful, but man has their overuse really killed them!

New comment by lambda in "How to stop Claude from saying load-bearing"

lambda — Tue, 14 Jul 2026 15:28:43 +0000

It drives us crazy because everyone is using the same 2-3 different machines. So rather than each person having their own unique speaking style, the whole world (or, everyone that publishes direct LLM output) is now speaking in the same couple of styles.

And these machines all tend to converge on very similar styles; they have huge amounts of overlap in training data (much of it being already obnoxious internet marketing), they frequently train on each others outputs, and the RLHF process has a tendency to emphasize certain kinds of "cheap win" styles of speech.

New comment by lambda in "Demis Hassabis has a plan to harness AI safely"

lambda — Tue, 14 Jul 2026 15:19:12 +0000

He is saying that weaker models, as measured by a benchmark to distinguish "frontier" models, would be exempted. So an academic lab or startup that isn't yet producing frontier models would be exempted, but once it crossed some benchmark based threshold it would be subject to this kind of oversight.

Of course, right now you've got benchmaxxing going on; some companies specifically targetting benchmarks to appear stronger than they are on a wider range of tasks. Now you might see bench sandbagging, specifically looking weaker on certain benchmarks to avoid regulatory oversight.

For instance, once way I could see this going for open models is to release them undercooked; stop the RLVR process a bit early, leaving them a bit weaker on tool calls and agentic performance, but also release the RLVR environment so people can finish the process themselves.

In fact, this is fairly close to what Nvidia is already doing, the Nemotron 3 models are somewhat undercooked but they are releasing their full training pipeline, to encourage people to use these models as a base for further training, which will generally be done on Nvidia hardware.

New comment by lambda in "QuadRF can spot drones and see WiFi through my wall"

lambda — Fri, 10 Jul 2026 17:44:55 +0000

Yeah, Kraken SDR removed some functionality due to these concerns, if I remember correctly.

Odd, because export controls don't generally apply to published material (like open source software), but maybe they were worried that because they were also selling the hardware they could have issues due to the combo being export controlled.

Ah, found discussion of what exactly it was they pulled, it was the passive radar code: https://www.reddit.com/r/RTLSDR/comments/yu9rei/krakenrf_pul...

And indeed, they confirmed that they believe the open source software should be OK, but they had concerns because they also sell the compatible hardware: https://nitter.net/rtlsdrblog/status/1591657740229046274

New comment by lambda in "QuadRF can spot drones and see WiFi through my wall"

lambda — Fri, 10 Jul 2026 17:35:04 +0000

But there are already benchtop or handheld signal analyzer for that purpose.

This seems more like a tool for checking across entire large assemblies like an entire building, car, aircraft, etc, for unknown sources. If you have an individual discrete device that you're already testing, just using traditional instrumentation seems reasonable, but on a large, complex assembly, I can see it being useful. Also useful for things like detecting if a particular antenna is working without actually going up there to measure near it; if you have a MIMO setup with multiple antennas, this might make it easier to check if all of them are working correctly when mounted in inconvenient areas.

New comment by lambda in "QuadRF can spot drones and see WiFi through my wall"

lambda — Fri, 10 Jul 2026 17:32:24 +0000

I think that for a single device, this probably wouldn't help much over just having a more traditional signal analyzer, either benchtop or handheld. If you know what you're testing, just using a signal analyzer around it will give you a good first pass picture of emissions, and probably be much more informative and precise than this.

This seems more useful for finding unknown or hidden RF sources, for instance looking thorugh an entire building to find unknown RF sources, or maybe a whole complex assembly like a car or aircraft.

New comment by lambda in "Kani: A Model Checker for Rust"

lambda — Mon, 06 Jul 2026 19:56:27 +0000

This is really weird. Someone creating 4 new accounts just to call this project fraudulent because it can't statically analyze every property you'd like? Does this person have a personal grudge against the author, or something?

New comment by lambda in "Qwen 3.6 27B is the sweet spot for local development"

lambda — Thu, 02 Jul 2026 15:29:53 +0000

Tried it out. I'm compring against Qwen 3.5 122B-A10B, so a much larger model. It gets some correct, but Qwen 3.5 122B-A10B has done much better. Gemma 4 12B even hallucinated some species in trying to identify a plant, and the other guesses it made weren't all that close, while Qwen 3.5 122B-A10B got it right on the first try.

12B did get one right that 31B got wrong. I'd have to do a much more thorough eval to really compare, just a few anecdotal observations and it's kind of hard to really distinguish, but from the samples I've seen, Qwen 3.5 122B-A10B is doing much better at this task.

The 12B architecture definitely is interesting, and it may punch above its weight due to this (though again, would really need to do proper evals to compare). But of the models I've tried, Qwen3.5 122B-A10B really seems like the best for this kind of task.

New comment by lambda in "Qwen 3.6 27B is the sweet spot for local development"

lambda — Tue, 30 Jun 2026 14:54:44 +0000

I haven't run a proper eval, but I've been getting better luck with Qwen models than Gemma on plant and animal identification using vision.

I do like Gemma for translation, however.

New comment by lambda in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"

lambda — Sat, 20 Jun 2026 02:04:36 +0000

How could the harness fix this? It's the jinja template used by the inference engine to render the API requests into the raw text that gets tokenized and completed by the model. Unless you're using something like the raw completions API instead of the `/v1/chat/completions` API, and effectively applying the template yourself. In which case, you could also just modify the jinja template on your server.

Anyhow, I've heard mixed results on any method of supplying reasoning traces beyond the current turn to models not trained on them. For some models, I've heard that it works fine this way, for others I've heard it degrades performance. But I don't know of anyone who has any kind of reliable benchmark for how well this works.

New comment by lambda in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"

lambda — Wed, 17 Jun 2026 14:57:53 +0000

Much more complex than that. Even if it does give you a speedup at certain tasks, is it worth the cost and risks? You go faster, but now you have more code that you don't understand and so won't be as good at maintaining. There's the engergy use, the water use, the scrapers destroying the internet, the massive piles of slop, the hallucinations and bullshit, etc.

New comment by lambda in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"

lambda — Wed, 17 Jun 2026 12:55:34 +0000

It means that even if it works for certain tasks, I think that the problems caused by use of LLMs outweigh their benefits. I think it's a bad idea to generate large piles of code that you don't understand, but due to competitive pressures, it's too tempting for people to pass up, leading to a world in which software is getting worse by the day, while pumping CO2 into the atmosphere and boiling scarce water supplies to do so, DDOSing websites to scrape the data, and polluting the internet with mountains of slop.

This isn't about using rice cookers or not, that's a personal choice for how you cook your food, and choosing to do so or not really only affects the person cooking and cleaning. A rice cooker probably uses a similar amount of energy as cooking it by hand, possibly even less.

But when people using LLMs are causing active harm, and are making it more difficult to collaborate on a team, it's a lot harder to accept that it's just a personal preference.

If you wanted to use the rice cooker analogy, imagine if rice cookers let you cook rice in just one minute. Faster, don't have to wait for the rice to be done, great! But in order to do so, you have to cook 50 pounts of rice, but throw out the majority of it, and use a thousand kilowatt hours of energy to do so. You'd better believe I'm going to be skeptical of everyone deciding that they suddenly have to use these 1-minute rice cookers that burn so much energy and generate so much waste.

New comment by lambda in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"

lambda — Tue, 16 Jun 2026 23:51:05 +0000

Huh? There is a Claude 4 Opus. It was released about a year ago. It is retired by now, in fact, just retired yesterday: https://platform.claude.com/docs/en/about-claude/model-depre...

But it is still available on Google Vertex according to OpenRouter (though it's possible that info is just out of date, it's currently quoting 3tps which is unusably slow): https://openrouter.ai/anthropic/claude-opus-4

New comment by lambda in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"

lambda — Tue, 16 Jun 2026 16:55:56 +0000

Not a harness issue. The harness (pi in my case) passes back the cot for all previous turns.

The jinja template is what renders the openai-format request sent by the harness, into the actual string of text that will be tokenized and fed to the model. For models without preserve thinking support, the jinja template drops the reasoning from all but the current turn.

Here is the default jinja for Gemma 4: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_...

    {#- Render reasoning/reasoning_content as thinking channel -#}
    {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
    {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
        {{- '<|channel>thought\n' + thinking_text + '\n' -}}
    {%- endif -%}

You see that it only preserves the thinking for indexes that are later than the last user message; thinking is only preserved for a single turn (which can include a lot of interleaved thinking and tool calls), once it goes back to the user and the user replies, it will replay the tool calls but not the thinking between them.

Here's Qwen 3.6 by comparison: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/chat_t...

        {%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}
            {{- '<|im_start|>' + message.role + '\n\n' + reasoning_content + '\n\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}

It additionally has a preserve_thinking flag that you can set. If that's set, it will include all turns thinking in the text passed to the model. But you do have to set that, it's not the default.

It's possible to modify the jinja file that you're using with a model. Some people do that with models that haven't been specifically trained for it, and report good results; but some report that because it wasn't trained for that, they get worse results if they include thinking from previous turns.

So for models like Gemma, you would have to modify the default jinja to enable this. For Qwen, you can just set the preserve_thinking flag to get this behavior; and apparently they have trained in this mode so you get better results than models that have not trained this way.