Hacker News: Zababa

New comment by Zababa in "Are LLM merge rates not getting better?"

Zababa — Thu, 12 Mar 2026 13:17:56 +0000

I think it is important to try to find more rigorous things to test than the general sentiment of the people using the tools. If only because the more benchmarks we have the more we can improve models without regressions. METR is asking a really interesting question here, "are models improving at making one shot PRs?". The answer seems to be, yes, but slower than benchmarks suggest, if you look at the pass rate of different versions of Claude Sonnet. A reasonable answer is "you're not supposed to use them by making one shot PRs", but then ideally we would need to have some kind of standarized test for the ability of models to incorporate feedback and evolve PRs.

New comment by Zababa in "Are LLM merge rates not getting better?"

Zababa — Thu, 12 Mar 2026 13:11:55 +0000

From the METR study (https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs...):

>To study how agent success on benchmark tasks relates to real-world usefulness, we had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests (PRs). We had maintainers (hypothetically) accept or request changes for patches as well as provide the core reason they were requesting changes: core functionality failure, patch breaks other code or code quality issues.

I would also advise taking a look at the rejection reasons for the PRs. For example, Figure 5 shows two rejections for "code quality" because of (and I quote) "looks like a useless AI slop comment." This is something models still do, but that is also very easily fixable. I think in that case the issue is that the level of comment wanted hasn't been properly formalized in the repo and the model hasn't been able to deduce it from the context it had.

As for the article, I think mixing all models together doesn't make sense. For example, maybe a slope describe the increasing Claude Sonnet better than a step function.

New comment by Zababa in "3D-Knitting: The Ultimate Guide"

Zababa — Thu, 12 Mar 2026 11:03:25 +0000

> surely the solution to fast fashion is just to not buy and throw away so many clothes?

"just don't do X" has basically never worked, it is not a serious solution to any problem.

New comment by Zababa in "The optimal age to freeze eggs is 19"

Zababa — Tue, 10 Mar 2026 12:26:36 +0000

I agree that revealed preferences are stronger signals than stated ones. https://funds.effectivealtruism.org/ shows 52000 donors for $110M, https://www.givingwhatwecan.org/ says more than 10000 donors and more than $490M given.

New comment by Zababa in "The optimal age to freeze eggs is 19"

Zababa — Tue, 10 Mar 2026 09:44:39 +0000

That is true but also a bit unfair, they've also been oddly preoccupied with topics like trying to help the most people and frequently promote giving money to efficient charities to fight against malaria, vitamin A deficiencies and help vaccinate children in very poor countries.

New comment by Zababa in "Hetzner Prices increase 30-40%"

Zababa — Mon, 23 Feb 2026 13:43:27 +0000

This image comes from running the different versions of the benchmark games programs. Some of the difference between languages may actually be just algorithmic differences, and also those programs are in general not representative of most of the software that runs.

New comment by Zababa in "Tesla has to pay historic $243M judgement over Autopilot crash, judge says"

Zababa — Fri, 20 Feb 2026 20:36:02 +0000

I have no tolerance for bystanders being killed in general. If the science experiments kill on average less bystanders I'm all for them, if they don't they should be stopped until made safer.

New comment by Zababa in "Rolling your own serverless OCR in 40 lines of code"

Zababa — Mon, 16 Feb 2026 14:56:24 +0000

HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.

New comment by Zababa in "Qwen3.5: Towards Native Multimodal Agents"

Zababa — Mon, 16 Feb 2026 14:48:45 +0000

Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?

New comment by Zababa in "English professors double down on requiring printed copies of readings"

Zababa — Mon, 02 Feb 2026 13:09:30 +0000

>Sure, you can have my little assessment at the end if you like, but I work for the students, not for the companies.

Most of the students are here because they want to be in the companies, not for the joy of learning.

New comment by Zababa in "English professors double down on requiring printed copies of readings"

Zababa — Mon, 02 Feb 2026 13:06:58 +0000

>Last semester, professor Pamela Newton, who also teaches the course, allowed students to bring readings either on tablets or in printed form. While laptops felt like a “wall” in class, Newton said, students could use iPads to annotate readings and lie them flat on the table during discussions. However, Newton said she felt “paranoid” that students could be texting during class.

>This semester, Newton has removed the option to bring iPads to class, except for accessibility needs, as a part of the general movement in the “Reading and Writing the Modern Essay” seminars to “swim against the tide of AI use,” reduce “the infiltration of tech,” and “go back to pen and paper,” she said.

Is this about teaching efficiency or managing the teacher's feelings? If "the infiltration of tech" allowed for better learning, would this teacher even be open to it?

New comment by Zababa in "How AI assistance impacts the formation of coding skills"

Zababa — Fri, 30 Jan 2026 18:58:00 +0000

>It's pretty insidious to think that these AI labs want you become so dependent on them so that once the VC-gravy-train stops they can hike the token price 10x and you'll still pay because you have no other choice.

I don't think that's true? From what I understand most labs are making money from subscription users (maybe not if you include training costs, but still, they're not selling at a loss).

>(thankfully market dynamics and OSS alternatives will probably stop this but it's not a guarantee, you need like at least six viable firms before you usually see competitive behavior)

OpenAI is very aggressive with the volume of usage you can get from Codex, Google/DeepMind with Gemini. Anthropic reduced the token price with the latest Opus release (4.5).

New comment by Zababa in "Trinity large: An open 400B sparse MoE model"

Zababa — Thu, 29 Jan 2026 08:41:07 +0000

>To get to the point of executing a successful training run like that, you have to count every failed experiment and experiment that gets you to the final training run.

I get the sentiment, but then, do you count all the other experiments that were done by that company before specifically trying to train this model? All the experiments done by people in that company at other companies? Since they rely on that experience to train models.

You could say "count everything that has been done since the last model release", but then for the same amount of effort/GPU, if you release 3 models does that divide each model cost by 3?

Genuinely curious in how you think about this, I think saying "the model cost is the final training run" is fine as it seems standard ever since DeepSeek V3, but I'd be interested if you have alternatives. Possibly "actually don't even talk about model cost as it will always be misleading and you can never really spend the same amount of money to get the same model"?

New comment by Zababa in "Trinity large: An open 400B sparse MoE model"

Zababa — Thu, 29 Jan 2026 08:35:04 +0000

>E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

I think in that specific case that says more about LMArena than about the newer models. Remember that GPT 4o was so specifically loved by people that when GPT 5 replaced there was lots of backlash against OpenAI.

One of the popular benchmarks right now is METR which shows some real improvement with newer models, like Opus 4.5. Another way of getting data is anecdotes, lots of people are really impressed with Opus 4.5 and Codex 5.2 (but they're hard distangle from people getting better with those tools, the scaffolding (Claude code, Codex) getting better, and lots of other stuff). SWEBench is still not saturated (less than 75% I think).

New comment by Zababa in "Water 'Bankruptcy' Era Has Begun for Billions, Scientists Say"

Zababa — Mon, 26 Jan 2026 14:28:45 +0000

Why would you leave the question of whether it's true or not aside? If it's false, isn't it a good thing that not many people are ready to admit something false?

New comment by Zababa in "Scaling long-running autonomous coding"

Zababa — Tue, 20 Jan 2026 14:11:15 +0000

>My digital thermometer doesn't think. Imbibing LLM's with thought will start leading to some absurd conclusions.

What kind of absurd conclusions? And what kind of non absurd conclusions can you make when you follow your let's call it "mechanistic" view?

>It's an algorithm and a completely mechanical process which you can quite literally copy time and time again. Unless of course you think 'physical' computers have magical powers that a pen and paper Turing machine doesn't?

I don't, just like I don't think a human or animal brain has any magical power that imbues it with "intelligence" and "reasoning".

>A cursory read of basic philosophy would help elucidate why casually saying LLM's think, reason etc is not good enough.

I'm not saying they do or they don't, I'm saying that from what I've seen having a strong opinion about whether they think or they don't seem to lead people to weird places.

>What is thinking? What is intelligence? What is consciousness? These questions are difficult to answer. There is NO clear definition.

You see pretty certain that whatever those three things are a LLM isn't doing it, a paper and pencil aren't doing it even when manipulated by a human, the system of a human manipulating a paper and pencil isn't doing it.

New comment by Zababa in "Scaling long-running autonomous coding"

Zababa — Tue, 20 Jan 2026 13:46:43 +0000

Can you give examples of how that "LLM's do not think, understand, reason, reflect, comprehend and they never shall" or that "completely mechanical process" helps you understand better when LLM works and when they don't?

Many people are throwing around that they don't "think", that they aren't "conscious", that they don't "reason", but I don't see those people sharing interesting heuristics to use LLMs well. The "they don't reason" people tend to, in my opinion/experience, underestimate them by a lot, often claiming that they will never be able to do .

To be fair, the "they reason/are conscious" people tend to, in my opinion/experience, overestimate how much a LLM being able to "act" a certain way in a certain situation says about the LLM/LLMs as a whole ("act" is not a perfect word here, another way of looking at it is that they visit only the coast of a country and conclude that the whole country must be sailors and have a sailing culture).

New comment by Zababa in "Giving university exams in the age of chatbots"

Zababa — Tue, 20 Jan 2026 12:37:00 +0000

>If you have this great resource available to you (an LLM) you better show that you read and checked its output. If there's something in the LLM output you do not understand or check to be true, you better remove it.

You could say the same about what people find on the web, yet LLMs are penalized more than web search.

>If you do not use LLMs and just misunderstood something, you will have an (flawed) justification for why you wrote this. If there's something flawed in an LLM, the likelihood that you do not have any justification except for "the LLM said so" is quite high and should thus be penalized higher.

Swap "LLMs" for "websites" and you could say the exact same thing.

The author has this in their conclusions:

>One clear conclusion is that the vast majority of students do not trust chatbots. If they are explicitly made accountable for what a chatbot says, they immediately choose not to use it at all.

This is not true. What is true is that if the students are more accountable for their use of LLMs than their use of websites, they prefer using websites. What is "more" here? We have no idea, the author didn't say so. It could be that an error from a website or your own mind is -1 point and from a LLM is -2, so LLMs have to make two times less mistakes than websites and your mind. It could be -1 and -1.25. It could be -1 and -10.

The author even says themselves:

>In retrospect, my instructions were probably too harsh and discouraged some students from using chatbots.

But they don't note the bias they introduced against LLMs with their notation.

New comment by Zababa in "Giving university exams in the age of chatbots"

Zababa — Tue, 20 Jan 2026 11:02:05 +0000

> Mistakes made by chatbots will be considered more important than honest human mistakes, resulting in the loss of more points.

>I thought this was fair. You can use chatbots, but you will be held accountable for it.

So you're held more accountable for the output actually? I'd be interested in how many students would choose to use LLMs if faults weren't penalized more.

Vibe Kanban

Zababa — Mon, 19 Jan 2026 13:54:36 +0000

Article URL: https://www.vibekanban.com/

Comments URL: https://news.ycombinator.com/item?id=46678998

Points: 1

# Comments: 0