Hacker News: ankit219

New comment by ankit219 in "Was my $48K GPU server worth it?"

ankit219 — Sat, 23 May 2026 08:12:02 +0000

at a gross margin level, mostly no. if you include the cost of training a model as full R&D then possibly yes.

Batch size is what you should look at. If a cluster is running and processing one request, filling the batch has almost no marginal cost (kv cache creation/storage/fetch costs aside). But if the concurrent requests exceed batch size, one extra request would cost basically the rent cost of entire new cluster. APIs have the bursty nature so companies would plan to price it such that they are profitable / break even at 40%-50% utilization (% of filled batch for simplicity). so any extra request would not have the same cost as long as they are alongside an api request. you might think it degrades teh performance. easy: just assign a priority tier to api requests, and a lower tier to subscription requests.

its even more effective and powerful now that you have continuous batching. so likely if the api is being used, they are not eating any loss, let alone "big loss"

LLMs give us a way to factorize intelligence

ankit219 — Thu, 21 May 2026 18:07:53 +0000

Article URL: https://ankitmaloo.com/intelligence/

Comments URL: https://news.ycombinator.com/item?id=48226748

Points: 1

# Comments: 0

New comment by ankit219 in "An update on recent Claude Code quality reports"

ankit219 — Thu, 23 Apr 2026 21:21:53 +0000

An interesting question to wonder is why these optimizations were pushed so aggressively in the first place. Especially given this is the time they were running a 2x promotion, by themselves, without presumably seeing any slowdown in demand.

New comment by ankit219 in "Addressing Antigravity Bans and Reinstating Access"

ankit219 — Sat, 28 Feb 2026 17:10:26 +0000

this is good.

problem is google's security concerns. when people connect gmail to openclaw, google flags the activity as weird and suspend the account because of unusual activity. Many whose accounts got locked because of this and they thought it was because they connected it to antigravity use against the policy (which happened in some cases). We will still see google account suspensions, and that would keep making news. and it wont be because of antigravity usage.

New comment by ankit219 in "Gemini 3.1 Pro"

ankit219 — Thu, 19 Feb 2026 18:17:21 +0000

not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.

New comment by ankit219 in "Two different tricks for fast LLM inference"

ankit219 — Sun, 15 Feb 2026 17:41:06 +0000

> Batching multiple users up thus increases overall throughput at the cost of making users wait for the batch to be full.

writer has not heard of continuous batching. this is no longer an issue. this is what makes claude code that affordable. https://huggingface.co/blog/continuous_batching

New comment by ankit219 in "Two different tricks for fast LLM inference"

ankit219 — Sun, 15 Feb 2026 17:33:55 +0000

People are misunderstanding Anthropic's fast mode because they chose to name it that way. The hints all point to a specific thing they did. The setup is costlier, its also smarter and better on tougher problems which is unheard of in terms of speed. This paper[1] fits perfectly:

The setup is parallel distill and refine. You start with parallel trajectories instead of one, then distill from them, and refine that to get to an answer. Instead of taking all trajectories to completion, they distill it quickly and refine so it gives outputs fast and yet smarter.

- paper came out in nov 2025

- three months is a good research to production pipeline

- one of the authors is at anthropic

- this approach will definitely burn more tokens than a usual simple run.

- > Anthropic explicitly warns that time to first token might still be slow (or even slower)

To what people are saying, speculative decoding wont be smarter or make any difference. Batching could be faster, but then not as costly.

Gemini Deepthink and gpt-5.2-pro use the same underlying parallel test time compute but they take each trajectory to completion before distilling and refining for the user.

[1]: https://arxiv.org/abs/2510.01123

New comment by ankit219 in "Gemini 3 Deep Think"

ankit219 — Thu, 12 Feb 2026 22:28:51 +0000

Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.

(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)

Show HN: Open-Source SDK for AI Knowledge Work

ankit219 — Tue, 10 Feb 2026 17:06:00 +0000

GitHub: https://github.com/ClioAI/kw-sdk

Most AI agent frameworks target code. Write code, run tests, fix errors, repeat. That works because code has a natural verification signal. It works or it doesn't.

This SDK treats knowledge work like an engineering problem:

Task → Brief → Rubric (hidden from executor) → Work → Verify → Fail? → Retry → Pass → Submit

The orchestrator coordinates subagents, web search, code execution, and file I/O. then checks its own work against criteria it can't game (the rubric is generated in a separate call and the executor never sees it directly).

We originally built this as a harness for RL training on knowledge tasks. The rubric is the reward function. If you're training models on knowledge work, the brief→rubric→execute→verify loop gives you a structured reward signal for tasks that normally don't have one.

What makes Knowledge work different from code? (apart from feedback loop) I believe there is some functionality missing from today's agents when it comes to knowledge work. I tried to include that in this release. Example:

Explore mode: Mapping the solution space, identifying the set level gaps, and giving options.

Most agents optimize for a single answer, and end up with a median one. For strategy, design, creative problems, you want to see the options, what are the tradeoffs, and what can you do? Explore mode generates N distinct approaches, each with explicit assumptions and counterfactuals ("this works if X, breaks if Y"). The output ends with set-level gaps ie what angles the entire set missed. The gaps are often more valuable than the takes. I think this is what many of us do on a daily basis, but no agent directly captures it today. See https://github.com/ClioAI/kw-sdk/blob/main/examples/explore_... and the output for a sense of how this is different.

Checkpointing: With many ai agents and especially multi agent systems, i can see where it went wrong, but cant run inference from same stage. (or you may want multiple explorations once an agent has done some tasks like search and is now looking at ideas). I used this for rollouts a lot, and think its a great feature to run again, or fork from a specific checkpoint.

A note on Verification loop: The verify step is where the real leverage is. A model that can accurately assess its own work against a rubric is more valuable than one that generates slightly better first drafts. The rubric makes quality legible — to the agent, to the human, and potentially to a training signal.

Some things i like about this: - You can pass a remote execution environment (including your browser as a sandbox) and it would work. It can be docker, e2b, your local env, anything, the model will execute commands in your context, and will iterate based on feedback loop. Code execution is a protocol here.

- Tool calling: I realize you don't need complex functions. Models are good at writing terminal code, and can iterate based on feedback, so you can just pass either functions in context and model will execute or you can pass docs and model will write the code. (same as anthropic's programmatic tool calling). Details: https://github.com/ClioAI/kw-sdk/blob/main/TOOL_CALLING_GUID...

Lastly, some guides: - SDK guide: https://github.com/ClioAI/kw-sdk/blob/main/SDK_GUIDE.md - Extensible. See bizarro example where i add a new mode: https://github.com/ClioAI/kw-sdk/blob/main/examples/custom_m... - working with files: https://github.com/ClioAI/kw-sdk/blob/main/examples/with_fil... - this is simple but i love the csv example: https://github.com/ClioAI/kw-sdk/blob/main/examples/csv_rese... - remote execution: https://github.com/ClioAI/kw-sdk/blob/main/examples/with_cus...

And a lot more. This was completely refactored by opus and given the rework, probably would have taken a lot of time to release it.

MIT licensed. Would love your feedback.

Comments URL: https://news.ycombinator.com/item?id=46963026

Points: 21

# Comments: 1

New comment by ankit219 in "Experts Have World Models. LLMs Have Word Models"

ankit219 — Mon, 09 Feb 2026 16:18:48 +0000

(author here) great paper to cite.

What i think you are referring to is hidden state as in internal representations. I refer to hidden state in game theoretic terms like a private information only one party has. I think we both agree alphazero has hidden states in first sense.

Concepts like king safety are objectively useful for winning at chess so alphazero developed it too, no wonder about that. Great example of convergence. However, alphazero did not need to know what i am thinking or how i play to beat me. In poker, you must model a player's private cards and beliefs.

New comment by ankit219 in "Experts Have World Models. LLMs Have Word Models"

ankit219 — Mon, 09 Feb 2026 16:01:39 +0000

Bounded domains require scaling reasoning/compute. Two separate scenarios - one where you have hidden information, other where you have high number of combinations. Reasoning works in second case because it narrows the search space. Eg: a doctor trying to diagnose a patient is just looking at number of possibilities. If not today, when we scale it up, a model will be able to arrive at the right answer. Same goes with Math, the variance or branching for any given problem is very high. But LLMs are good at it. and getting better. A negotiation is not a high variance thing, and low number of combinations, but llms would be repeated bad at it.

New comment by ankit219 in "Experts Have World Models. LLMs Have Word Models"

ankit219 — Sun, 08 Feb 2026 22:57:26 +0000

(Author here)

I address that in part right there itself. Programming has parts like chess (ie bounded) which is what people assume to be actual work. Understanding future requiremnts / stakeholder incentives is part of the work which LLMs dont do well.

> many domains are chess-like in their technical core but become poker-like in their operational context.

This applies to programming too.

New comment by ankit219 in "OpenClaw is what Apple intelligence should have been"

ankit219 — Thu, 05 Feb 2026 01:38:39 +0000

> And they would have won the AI race not by building the best model, but by being the only company that could ship an AI you’d actually trust with root access to your computer.

and the very next line (because i want to emphasize it

> That trust—built over decades—was their moat.

This just ignores the history of os development at apple. The entire trajectory is moving towards permissions and sandboxing even if it annoys users to no end. To give access to an llm (any llm, not just a trusted one acc to author) the root access when its susceptible to hallucinations, jailbreak etc. goes against everything Apple has worked for.

And even then the reasoning is circular. "So you build all your trust, now go ahead and destroy it on this thing which works, feels good to me, but could occasionally fuck up in a massive way".

Not defending Apple, but this article is so far detached from reality that its hard to overstate.

New comment by ankit219 in "World Models"

ankit219 — Thu, 29 Jan 2026 17:45:40 +0000

you are comparing post hoc narratives in the training data to real time learning from causal dynamics. The objectives are different. They may look the same in scenarios where its heavily and accurately documented, but most narratives suffer from survivorship bias and reasoning post facto, eulogising the given outcomes.

World Models

ankit219 — Sun, 25 Jan 2026 09:03:26 +0000

Article URL: https://ankitmaloo.com/world-models/

Comments URL: https://news.ycombinator.com/item?id=46752138

Points: 28

# Comments: 4

New comment by ankit219 in "Auto-compact not triggering on Claude.ai despite being marked as fixed"

ankit219 — Fri, 23 Jan 2026 20:48:02 +0000

think this particular complaint is about claude ai - the website - and not claude code. I see your point though.

New comment by ankit219 in "I was banned from Claude for scaffolding a Claude.md file?"

ankit219 — Thu, 22 Jan 2026 22:08:33 +0000

Its a combination. All caps is used in prompts for extra insistence, and has been common in cases of prompt hijacking. OP was doing it in combination with attempting to direct claude a certain way, multiple times, which might have looked similar to attempting to bypass teh system prompt.

New comment by ankit219 in "I was banned from Claude for scaffolding a Claude.md file?"

ankit219 — Thu, 22 Jan 2026 20:47:06 +0000

from what i know, it used to be that if you want to assertively instruct, you used all caps. I don't know if it succeeds today. I still see prompts where certain words are capitalized to ensure model pays attention. What i mean was not just capitalization, but a combination of both capitalization and changing the behavior of the model for trying to get it to do something.

if you were to design a system to prevent prompt injections and one of surefire ways is to repeatedly give instructions in caps, you would have systems dealing with it. And with instructions to change behavior, it cascades.

New comment by ankit219 in "I was banned from Claude for scaffolding a Claude.md file?"

ankit219 — Thu, 22 Jan 2026 19:59:19 +0000

My rudimentary guess is this. When you write in all caps, it triggers sort of a alert at Anthropic, especially as an attempt to hijack system prompt. When one claude was writing to other, it resorted to all caps, which triggered the alert, and then the context was instructing the model to do something (which likely would be similar to a prompt injection attack) and that triggered the ban. not just caps part, but that in combination of trying to change the system characteristics of claude. OP does not know much better because it seems he wasn't closely watching what claude was writing to other file.

if this is true, the learning is opus 4.5 can hijack system prompts of other models.

Every big lab is putting resources in building world models

ankit219 — Wed, 21 Jan 2026 19:17:28 +0000

Article URL: https://ankitmaloo.com/world-models/

Comments URL: https://news.ycombinator.com/item?id=46710152

Points: 2

# Comments: 0