Hacker News: veselin

New comment by veselin in "GLM 5.2 beats Claude in our benchmarks"

veselin — Sun, 28 Jun 2026 20:03:43 +0000

Here, it appears they compare a single prompt "find IDOR", against a multi-agent system. However, one can also start far more sophisticated skills that spin up subagents and mostly do the same in Claude Code, Codex, OpenCode, Pi, etc.

Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?

New comment by veselin in "Why current LLM costs are not sustainable"

veselin — Fri, 26 Jun 2026 09:12:21 +0000

The more I think on the problem, the more I believe this will be solved with US interventions. And the interventions will increase inflation by a lot, so prices will not go down.

The other alternatives with LLMs becoming more expensive in an Uber-like move may not work due to a lot of competition. I also don't think usage will increase 10x. I don't always have coding tasks for an LLM despite it being good.

My reasons to believe so are outside of what interests HN community and I am neither endorsing this behavior, nor I think it is that simple. But US also has a huge debt that it must service. Wouldn't it be convenient if it was suddenly halved in actual value?

New comment by veselin in "MAI-Code-1-Flash"

veselin — Tue, 02 Jun 2026 20:06:59 +0000

Claude code itself spins a lot of its subagents with Haiku. The model has low hallucination rate, so it is great for exploration tasks. I guess this is what the best purpose of this model here will be as well. Which is a lot of tokens - many tasks spin multiple exploration agents before the planning or fixing, that is then just a few tool calls.

New comment by veselin in "The bootstrapper's EU stack for under €10 per month"

veselin — Mon, 25 May 2026 19:42:06 +0000

I would argue that with AI, this becomes less of an issue. Connect N services, deploy to bare metal. Granted, AI is an additional cost now local or remote. But so is the MacBook people use to develop their software.

New comment by veselin in "Gemini 3.5 Flash"

veselin — Tue, 19 May 2026 19:52:51 +0000

Exactly our experience too. Effectively we catch these and on these status codes, we send to OpenAI. Retrying the same query in Gemini has high chance to give kind-of the same status code.

New comment by veselin in "Does coding with LLMs mean more microservices?"

veselin — Mon, 06 Apr 2026 10:38:31 +0000

I think this is a promise, probably also for spec driven development. You write the spec, the whole thing can be reimplemented in rust tomorrow. Make small modules or libraries.

One colleague describes monolith vs microservices as "the grass is greener of the other side".

In the end, having microservices is that that the release process becomes much harder. Every feature spans 3 services at least, with possible incompatibility between some of their versions. Precisely the work you cannot easily automate with LLMs.

New comment by veselin in "Caveman: Why use many token when few token do trick"

veselin — Sun, 05 Apr 2026 11:48:52 +0000

This is an experiment that, although not to this extreme, was tested by OpenAI. Their responses API allow you to control verbosity:

https://developers.openai.com/api/reference/resources/respon...

I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.

New comment by veselin in "AutoKernel: Autoresearch for GPU Kernels"

veselin — Wed, 11 Mar 2026 10:09:10 +0000

I guess we will have a lot more benefits if we can get this to work on something like llama.cpp - since it really has a lot of kernels for different quantizations, a lot of home users, high hardware diversity - so it is a likely place with highest bang for the buck.

I guess they can be a contributor there.

New comment by veselin in "Anthropic announces proof of distillation at scale by MiniMax, DeepSeek,Moonshot"

veselin — Mon, 23 Feb 2026 19:23:41 +0000

I think they put two things:

* Likely they will seek regulation that would ban some models. Not sure this can work, but they will certainly try.

* Likely they will not release some of their next models in the API.

New comment by veselin in "Gemini 3.1 Pro"

veselin — Thu, 19 Feb 2026 18:16:19 +0000

I am actually going to complain about this: that neither of the Gemini models are not preview ones.

Anthropic seems the best in this. Everything is in the API on day one. OpenAI tend to want to ask you for subscription, but the API gets there a week or a few later. Now, Gemini 3 is not for production use and this is already the previous iteration. So, does Google even intent to release this model?

New comment by veselin in "GLM-4.7-Flash"

veselin — Mon, 19 Jan 2026 20:48:14 +0000

What is the state of using quants? For chat models, a few errors or lost intelligence may matter a little. But what is happening to tool calling in coding agents? Does it fail catastrophically after a few steps in the agent?

I am interesting if I can run it on a 24GB RTX 4090.

Also, would vllm be a good option?

New comment by veselin in "How to code Claude Code in 200 lines of code"

veselin — Sat, 10 Jan 2026 15:43:17 +0000

I am taking for SWE bench style problems where Todo doesn't help, except for more parallelism.

New comment by veselin in "How to code Claude Code in 200 lines of code"

veselin — Fri, 09 Jan 2026 06:48:45 +0000

I run evals and the Todo tool doesn't help most of the time. Usually models on high thinking would maintain Todo/state in their thinking tokens. What Todo helps is for cases like Anthropic models to run more parallel tool calls. If there is a Todo list call, then some of the actions after are more efficient.

What you need to do is to match the distribution of how the models were RL-ed. So you are right to say that "do X in 200 lines" is a very small part of the job to be done.

New comment by veselin in "Gemini 3 Pro Model Card [pdf]"

veselin — Tue, 18 Nov 2025 14:38:43 +0000

I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.

However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.

New comment by veselin in "Qwen3-Coder: Agentic coding in the world"

veselin — Wed, 23 Jul 2025 10:59:06 +0000

Anybody knows if one can find an inference provider that offers input token caching? It should be almost required for agentic use - first speed, but also almost all conversations start where the previous ended, so cost may end up quite higher with no caching.

I would have expected good providers like Together, Fireworks, etc support it, but I can't find it, except if I run vllm myself on self-hosted instances.

New comment by veselin in "I'm dialing back my LLM usage"

veselin — Wed, 02 Jul 2025 16:07:55 +0000

I think that people are just too quick to assume this is amazing, before it is there. Which doesn't mean it won't get there.

Somehow if I take the best models and agents, most hard coding benchmarks are at below 50% and even swe bench verified is like at 75 maybe 80%. Not 95. Assuming agents just solve most problems is incorrect, despite it being really good at first prototypes.

Also in my experience agents are great to a point and then fall off a cliff. Not gradually. Just the type of errors you get past one point is so diverse, one cannot even explain it.

New comment by veselin in "Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison"

veselin — Mon, 31 Mar 2025 12:53:27 +0000

I noticed a similar trends in selling on X. Put a claim, peg on some product A with good sales - Cursor, Claude, Gemini, etc. Then say, the best way to use A is with our best product, guide, being MCP or something else.

For some of these I see something like 15k followers on X, but then no LinkedIn page for example. Website is always a company you cannot contact and they do everything.

New comment by veselin in "AMD 3D V-Cache teardown shows majority of the Ryzen 7 9800X3D is dummy silicon"

veselin — Wed, 18 Dec 2024 16:58:26 +0000

Yes. The article is click bait. With such a title I would have expected majority of the area to be dummy, but it is just structurally more silicon, exactly like a picture may be majority of its mass wood.

New comment by veselin in "The lifecycle of a code AI completion"

veselin — Mon, 08 Apr 2024 05:58:17 +0000

I used them both.

I ended up disabling copilot. The reason is that the completions do not always integrate with the rest of the code, in particular with non-matching brackets. Often it just repeats some other part of the code. I had much fewer cases of this with Cody. But, arguably, the difference is not huge. But then add on top of this choice of models.

New comment by veselin in "Why AWS Supports Valkey"

veselin — Sat, 06 Apr 2024 09:40:19 +0000

It seems recent years give us a lot of licenses (for core infra software) and now for LLMs. They all say in very legalese basically: these top 5-10 tech companies will not compete fairly with us, thus they are banned from using the software. The rest are welcome to use everything.

I wonder if US monopoly regulation actually starts to work well, which I see some signs of happening, will all this license revert back to fully open source?