Hacker News: versteegen

New comment by versteegen in "Day 1 of ARC-AGI-3"

versteegen — Fri, 27 Mar 2026 12:26:30 +0000

Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.

New comment by versteegen in "Day 1 of ARC-AGI-3"

versteegen — Fri, 27 Mar 2026 12:24:10 +0000

The dataset miscomparison is a big problem. The prompt is super specific to ARC-AGI-3, which is perfectly fine to do, but skimming it I saw nothing that appears specific to the 25 games in the dataset. Especially considering they've only had one day for overfitting. Could be quite subtle leakage though.

New comment by versteegen in "Day 1 of ARC-AGI-3"

versteegen — Fri, 27 Mar 2026 12:03:36 +0000

...Their agent is called "Agentica ARC-AGI-3 agent for Opus 4.6 (120k) High".

Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).

But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):

> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration

but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.

New comment by versteegen in "ARC-AGI-3"

versteegen — Thu, 26 Mar 2026 00:58:16 +0000

> An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.

Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.

New comment by versteegen in "No, it doesn't cost Anthropic $5k per Claude Code user"

versteegen — Tue, 10 Mar 2026 06:07:05 +0000

> Aren't they losing money on the retail API pricing, too?

No, they aren't, and probably neither is anyone else offering API pricing. And Anthropic's API margins may be higher than anyone else.

For example, DeepSeek released numbers showing that R1 was served at approximately "a cost profit margin of 545%" (meaning 82% of revenue is profit), see my comment https://news.ycombinator.com/item?id=46663852

New comment by versteegen in "No, it doesn't cost Anthropic $5k per Claude Code user"

versteegen — Tue, 10 Mar 2026 06:04:16 +0000

Ed Zitron made that claim (in particular here: [1]). In the same article he admits he not a programmer, and had to ask someone else to try out Claude Code and ccusage for him. He doesn't have any understanding of how LLMs or caching works. But he's prominent because he's received leaked financial details for Anthropic and OpenAI, eg [2]

[1] https://www.wheresyoured.at/anthropic-is-bleeding-out/ [2] https://www.wheresyoured.at/costs/

New comment by versteegen in "10% of Firefox crashes are caused by bitflips"

versteegen — Fri, 06 Mar 2026 09:56:06 +0000

I'm surprised "faulty PSU" is not on GP's list of common problems. Almost every unstable computer I've ever experienced has been due to either a dying PSU (not an under-specced one) or dying power conversion capacitors on the motherboard.

New comment by versteegen in "Claude's Cycles [pdf]"

versteegen — Wed, 04 Mar 2026 13:02:00 +0000

AFAICT, Claude was not asked to prove its algorithm works for all odd n, but was instead told to move on to even n.

New comment by versteegen in "GPT‑5.3 Instant"

versteegen — Wed, 04 Mar 2026 10:45:20 +0000

> Gemini 2.5 Pro's reasoning traces (before they nerfed them) were a good example. The deep technical analysis, and then the human-friendly version in the final output. But I found their reasoning more readable than the final output!

They were also sometimes more useful: you could see whether it reasoned its way to an answer, or used faulty reasoning, or if it was just contextual recall. Huge shame they replaced them with garbage (though a bit better now).

> the language is surprisingly offputting. I don't know if it got worse

I'm pretty sure it did.

New comment by versteegen in "Show HN: I built a site where you hire yourself instead of applying for jobs"

versteegen — Sat, 28 Feb 2026 00:28:50 +0000

Yes, .wow also.

New comment by versteegen in "Statement from Dario Amodei on our discussions with the Department of War"

versteegen — Fri, 27 Feb 2026 23:51:32 +0000

That misses my point: the evidence is the extensive argumentation provided for why it reduces risk. To quote Karnofsky:

> I wish people simply evaluated whether the changes seem good on the merits, without starting from a strong presumption that the mere fact of changes is either a bad thing or a fine thing. It should be hard to change good policies for bad reasons, not hard to change all policies for any reason.

New comment by versteegen in "Julia: Performance Tips"

versteegen — Fri, 27 Feb 2026 23:47:48 +0000

Yes. And I did port my GUI layer to CimGui.jl. The rest of it is pretty intertwined with Makie, didn't do that yet. The Makie version does look better than ImGui though.

New comment by versteegen in "Julia: Performance Tips"

versteegen — Fri, 27 Feb 2026 11:38:14 +0000

I recently used Makie to create an interactive tool for inspecting nodes of a search graph (dragging, hiding, expanding edges, custom graph layout), with floating windows of data and buttons. Yes, it's great for interactive plots (you can keep using the REPL to manipulate the plot, no freezing), yes Observables and GridLayout are great, and I was very impressed with Makie's plotting abilities from making the basics easy to the extremely advanced, but no, it was the wrong tool. Makie doesn't really do floating windows (subplots), and I had to jump through hoops to create my own float system which uses GridLayout for the GUI widgets inside them. I did get it to all work nearly flawlessly in the end, but I should probably have used a Julia imGUI wrapper instead: near instant start time!

New comment by versteegen in "Statement from Dario Amodei on our discussions with the Department of War"

versteegen — Fri, 27 Feb 2026 08:09:53 +0000

> They pragmatically changed their views of safety just recently, so those values for which they would burn at the stake are very fluid.

Yes it was a pragmatic change, no it was not a change in their values. The commentary here on HN about Anthropic's RSP change was completely off the mark. They "think these changes are the right thing for reducing AI risk, both from Anthropic and from other companies if they make similar changes", as stated in this detailed discussion by Holden Karnofsky, who takes "significant responsibility for this change":

https://www.lesswrong.com/posts/HzKuzrKfaDJvQqmjh/responsibl...

> I strongly think today’s environment does not fit the “prisoner’s dilemma” model. In today’s environment, I think there are companies not terribly far behind the frontier that would see any unilateral pause or slowdown as an opportunity rather than a warning.

> What I didn’t expect was that RSPs (at least in Anthropic’s case) would come to be seen as hard unilateral commitments (“escape clauses” notwithstanding) that would be very difficult to iterate on.

New comment by versteegen in "Claude Sonnet 4.6"

versteegen — Thu, 19 Feb 2026 14:54:50 +0000

Sure, depending on the particular product, having control and direct local access to the data would be desirable or deal breaker. But for this Clickup integration that's not so important to us (we can duplicate where necessary), while still using Clickup lets us use all the other features we can get via the web app.

(The emacs mode includes an MCP server)

New comment by versteegen in "Claude Sonnet 4.6"

versteegen — Wed, 18 Feb 2026 15:30:52 +0000

I'd agree that this effect is probably mainly due to architectural parameters such as the number and dimensions of heads, and hidden dimension. But not so much the model size (number of parameters) or less training.

Saw something about Sonnet 4.6 having had a greatly increased amount of RL training over 4.5.

New comment by versteegen in "Claude Sonnet 4.6"

versteegen — Wed, 18 Feb 2026 15:13:09 +0000

Agreed, and here's a real example from a tiny startup: Clickup's web app is too damn slow and bloated with features and UI, so we created emacs modes to access and edit Clickup workspaces (lists, kanban boards, docs, etc) via the API. Just some limited parts we care about. I was initially skeptical that it would work well or at all, but wow, it really has significantly improved the usefulness of Clickup by removing barriers.

New comment by versteegen in "GPT-5.3-Codex"

versteegen — Sun, 08 Feb 2026 10:30:21 +0000

> The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly

Not very accurate. For each of ARC-AGI-1 and ARC-AGI-2 there is training set and three eval sets: public, semi-private, and private. The ARC foundation runs frontier LLMs on the semi-private set, and the labs give them pre-release API access so they can report release-day evals. They mostly don't allow anyone else to access the semi-private set (except for live Kaggle leaderboards which use it), so you see independent researchers report on the public eval set instead, often very dubious. The private is for Kaggle competitions only, no frontier LLMs evals are possible.

(ARC-AGI-1 results are now largely useless because most of its eval tasks became the ARC-2 training set. However some labs have said they don't train LLMs on the training sets anyway.)

New comment by versteegen in "OpenAI could reportedly run out of cash by mid-2027"

versteegen — Sun, 18 Jan 2026 01:13:49 +0000

Paid/API LLM inference is profitable, though. For example, DeepSeek R1 had "a cost profit margin of 545%" [1] (ignoring free users and using a placeholder $2/hour figure H800 GPU, which seems ballpark of real to me due to Chinese electricity subsidies). Dario has said each Anthropic model is profitable over its lifetime. (And looking at ccusage stats and thinking Anthropic is losing thousands per Claude Code user is nonsense, API prices aren't their real costs. That's why opencode gives free access to GLM 4.7 and other models: it was far cheaper than they expected due to the excellent cache hit rates.) If anyone ran out of money they would stop spending on experiments/research and training runs and be profitable... until their models were obsolete. But it's impossible for everyone to go bankrupt.

[1] https://github.com/deepseek-ai/open-infra-index/blob/main/20...

New comment by versteegen in "Ask HN: How can we solve the loneliness epidemic?"

versteegen — Fri, 16 Jan 2026 11:00:27 +0000

It's excellent that you're working on loneliness! Somehow. What is it your startup actually does?