Hacker News: mnicky

New comment by mnicky in "Small models also found the vulnerabilities that Mythos found"

mnicky — Sat, 11 Apr 2026 20:20:14 +0000

Also, what is $20,000 today can be $2000 next year. Or $20...

See e.g. https://epoch.ai/data-insights/llm-inference-price-trends/

New comment by mnicky in "Gold overtakes U.S. Treasuries as the largest foreign reserve asset"

mnicky — Sat, 04 Apr 2026 23:02:33 +0000

This might be unconstitutional?

New comment by mnicky in "Gold overtakes U.S. Treasuries as the largest foreign reserve asset"

mnicky — Sat, 04 Apr 2026 22:46:56 +0000

Averages tell nothing about an average citizen.

Also, there are other measurements like inequality, healthcare cost, social securities...

New comment by mnicky in "GPT-5.4"

mnicky — Fri, 06 Mar 2026 08:40:54 +0000

This observation makes sense, because all models currently probably use some kind of a sparse attention architecture.

So the closer the two related pieces of information are to each other in the input context, the larger the chance their relationship will be preserved.

New comment by mnicky in "Cancel ChatGPT AI boycott surges after OpenAI pentagon military deal"

mnicky — Wed, 04 Mar 2026 07:09:21 +0000

He's trying to make it sound so, but in legal domain, devil lies in the details.

It seems that government wanted to use Claude for mass analysis of commercially obtained data on American people and Anthropic wouldn't let them (source: https://www.theatlantic.com/technology/2026/03/inside-anthro... ).

DoD kept asking for changes of contract where at least the legalese would be changed to somewhat more permissive but Anthropic stayed their ground.

Sam Altman probably let them do that, while using language like "we have technical means of oversight and the same red lines as Anthropic". But in reality they will allow DoD to do what Anthropic didn't.

See this for more information: https://www.lesswrong.com/posts/PBrggrw4mhgbksoYY/a-tale-of-...

New comment by mnicky in "How I use Claude Code: Separation of planning and execution"

mnicky — Sun, 22 Feb 2026 15:58:10 +0000

> Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.

I've experienced that too. Usually when I request correction, I add something like "Include only production level comments, (not changes)". Recently I also added special instruction for this to CLAUDE.md.

New comment by mnicky in "How I use Claude Code: Separation of planning and execution"

mnicky — Sun, 22 Feb 2026 15:53:06 +0000

Since some time, Claude Codes's plan mode also writes file with a plan that you could probably edit etc. It's located in ~/.claude/plans/ for me. Actually, there's whole history of plans there.

I sometimes reference some of them to build context, e.g. after few unsuccessful tries to implement something, so that Claude doesn't try the same thing again.

New comment by mnicky in "GPT‑5.3‑Codex‑Spark"

mnicky — Thu, 12 Feb 2026 20:29:41 +0000

Can you compare it to Opus 4.6 with thinking disabled? It seems to have very impressive benchmark scores. Could also be pretty fast.

New comment by mnicky in "GPT‑5.3‑Codex‑Spark"

mnicky — Thu, 12 Feb 2026 20:26:10 +0000

> What am I missing?

Largest production capacity maybe?

Also, market demand will be so high that every player's chips will be sold out.

New comment by mnicky in "Gemini 3 Deep Think"

mnicky — Thu, 12 Feb 2026 17:53:51 +0000

Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.

New comment by mnicky in "Gemini 3 Deep Think"

mnicky — Thu, 12 Feb 2026 17:50:27 +0000

> can a sufficiently large non thinking model perform the same as a smaller thinking?

Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).

New comment by mnicky in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"

mnicky — Wed, 11 Feb 2026 19:50:37 +0000

It is possible to think of tokens as some proxy for thinking space. At least reasoning tokens work like this.

Dollar/watt are not public and time has confounders like hardware.

New comment by mnicky in "Claude Code is being dumbed down?"

mnicky — Wed, 11 Feb 2026 18:47:49 +0000

At least now we also have a tracker: https://marginlab.ai/trackers/claude-code/

New comment by mnicky in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"

mnicky — Wed, 11 Feb 2026 18:44:00 +0000

What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?

How do you read this?

[1] https://imgur.com/a/EwW9H6q

New comment by mnicky in "GLM-5: Targeting complex systems engineering and long-horizon agentic tasks"

mnicky — Wed, 11 Feb 2026 17:33:33 +0000

> I think GPT-5.3-Codex was a disappointment

Care to elaborate more?

New comment by mnicky in "Frontier AI agents violate ethical constraints 30–50% of time, pressured by KPIs"

mnicky — Wed, 11 Feb 2026 09:03:44 +0000

Evaluation than depends on your specific cost-benefit tradeoff of accuracy vs hallucinations.

For some tasks where detecting hallucinations is easy I can see it being beneficial.

In general case not so much...

New comment by mnicky in "Frontier AI agents violate ethical constraints 30–50% of time, pressured by KPIs"

mnicky — Wed, 11 Feb 2026 08:49:49 +0000

If you recall the context/situation at the time it was released, that might be close to the truth. Google desperately needed to show competency in improving Gemini capabilities, and other considerations could have been assigned lower priority.

So they could have paid a price in “model welfare” and released an LLM very eager to deliver.

It also shows in AA-Omniscience Hallucination Rate benchmark where Gemini has 88%, the worst from frontier models.

New comment by mnicky in "Coding agents have replaced every framework I used"

mnicky — Sat, 07 Feb 2026 16:19:42 +0000

Critically, they will also enable faster future migration to a framework in case it proves useful.

New comment by mnicky in "Claude Opus 4.6"

mnicky — Thu, 05 Feb 2026 22:49:11 +0000

On my tasks (mostly data science), Opus has significantly lower probability of making stupid mistakes than Sonnet.

I'd still appreciate more intelligence than Opus 4.5 so I'm looking forward to trying 4.6.

New comment by mnicky in "AISLE’s autonomous analyzer found all CVEs in the January OpenSSL release"

mnicky — Wed, 28 Jan 2026 08:44:19 +0000

To your second point - why would you need this? There are _plenty_ of previously found CVEs to train on.

Also, I don't think the three letter agencies would share one of the most prized assets they have...