Hacker News: data_maan

New comment by data_maan in "Harness engineering: Leveraging Codex in an agent-first world"

data_maan — Sun, 07 Jun 2026 11:38:34 +0000

Was this in the GPT2 paper?

New comment by data_maan in "ML promises to be profoundly weird"

data_maan — Thu, 09 Apr 2026 15:11:30 +0000

If LLMs lie as much as the OP claims in the article, why can they then solve Olympiad math problems they never saw during training, consistently?

There's the aimoprize.com on Kaggle for example that shows this

New comment by data_maan in "Why the US Navy won't blast the Iranians and 'open' Strait of Hormuz"

data_maan — Thu, 02 Apr 2026 04:35:23 +0000

More "American minds": https://en.wikipedia.org/wiki/Hartmut_Esslinger

Chief designer at Apple war German.

New comment by data_maan in "Why the US Navy won't blast the Iranians and 'open' Strait of Hormuz"

data_maan — Wed, 01 Apr 2026 11:47:57 +0000

> built with American capital and mostly American minds.

I would say "built with American agency and commercial spirit", not minds.

Most of the things that we have were first built elsewhere (Germany being a prime supplier here with the mp3 or the Zuse), but turning them commercial was the input that came from America.

New comment by data_maan in "Why the US Navy won't blast the Iranians and 'open' Strait of Hormuz"

data_maan — Wed, 01 Apr 2026 11:37:46 +0000

To be fair, Iran is not pretentious either, killing a few thousand people because they dared to protest.

There are no good guys in this conflict.

New comment by data_maan in "Why the US Navy won't blast the Iranians and 'open' Strait of Hormuz"

data_maan — Wed, 01 Apr 2026 11:34:16 +0000

https://www.worldatlas.com/us-history/wars-the-united-states...

New comment by data_maan in "Epoch confirms GPT5.4 Pro solved a frontier math open problem"

data_maan — Tue, 24 Mar 2026 04:40:56 +0000

A model to whose internals we don't have access solved a problem we didn't knew was in their datasets. Great, I'm impressed

New comment by data_maan in "Israel is running critically low on interceptors, US officials say"

data_maan — Sun, 15 Mar 2026 11:43:41 +0000

Strategic? Yes.

Moral? Hm. From a moral POV this would be about who has the right to terrorize the Iranian population: the Iranian government or the US/Israel government.

New comment by data_maan in "Tell HN: I'm 60 years old. Claude Code has re-ignited a passion"

data_maan — Sat, 07 Mar 2026 12:13:25 +0000

Opinions differ: hobby coders love it, but domain expert secretly despise it because it narrows the gap between the skills they spent years honing and the average Claude, I mean Joe, that just uses this mental exoskeleton.

New comment by data_maan in "Tech employment now significantly worse than the 2008 or 2020 recessions"

data_maan — Sat, 07 Mar 2026 11:10:22 +0000

> The people getting pushed out are the intermediates and seniors who aren't high performers.

Also the people that can't market themselves. There are very average programmers that have a large following on X that seem to do very well.

New comment by data_maan in "Danish government agency to ditch Microsoft software (2025)"

data_maan — Wed, 25 Feb 2026 11:33:46 +0000

I love these posts that are so on the edge that I can't tell if it's sarcastic or for real :)

New comment by data_maan in "First Proof"

data_maan — Tue, 10 Feb 2026 19:24:11 +0000

> What do you mean ? These are top-notch mathematicians

YeS. I didn't dispute that. I disputed that they are NOT top notch ML specialist and have made one of the worst benchmarks of 2025-2026. Benchmarks like these would have worked maybe in early 2024 at latest. The field has moved on significantly since.

And yes, many many other benchmarks don't use toy problems -- their names are just a prompt away.

> You are kidding right ? FrontierMath benchmark [1] is produced by a startup whose incentives are dubious to say the least.

They did 1) open source some of their datapoints (on a similar order of magnitude) and 2) they carried out detailed evals. Here is much to learn from their blog posts, much more than from the current dataset.

But fair. If you don't like them, have a look at IMProofBench. Have a look at the AIMO competition. Have a loom at HardMath. It's quite a landscape of datasets already.

> Unlike the AI hypesters, these are real mathematicians trying to inject some realism and really test the boundaries of these tools

As mentioned above, realistic benchmarks that are bigger and better exist. Unfortunately, from a benchmarking POV, these mathematicians are the hypesters with a preprint that wouldnt even make it to the AI&Math workshops at ICML or NeurIPS.

New comment by data_maan in "First Proof"

data_maan — Tue, 10 Feb 2026 19:17:52 +0000

If it's the latter case (which it has to be), it seems that attention credit (via, e.g., articles in NY Times) is very unfairly distributed.

None of the people that advanced the state of benchmarking and did the hard work on much bigger benchmarks got any, but a ridiculous benchmark of 10 question scored big.

New comment by data_maan in "First Proof"

data_maan — Tue, 10 Feb 2026 19:15:22 +0000

> We will learn if the magical capabilities attributed to these tools are really true or not.

They're not. We already know that. FrontierMath. Yu Tsumura's 553th problem, RealMath benchmark. The list goes on. As I said many times on this thread, there is nothing novel in this benchmark.

This fact that this benchmark is so hyped shows that the community knows nothing, NOTHING, about prior work in this space, which makes me sad.

New comment by data_maan in "First Proof"

data_maan — Tue, 10 Feb 2026 19:12:46 +0000

> These problems are representative of the types of subproblems research mathematicians have to solve to get a “research result”. They are finding that LLMs aren’t that useful for mathematical research because they can’t crush these problems along the way. And I assume they put this doc together because they want that to change :)

Same holds true for IMProofBench problems. This dataset shows nothing new.

New comment by data_maan in "First Proof"

data_maan — Tue, 10 Feb 2026 19:11:46 +0000

But everything has been explored in other datasets already.

If only a bunch of mathematicians learn something, why are so many people talking about this, why is the NY Times posting about this?

This is the attention economy at its worst.

New comment by data_maan in "First Proof"

data_maan — Sun, 08 Feb 2026 08:28:33 +0000

It's not angst. It's intense frustration that they 1) are not doing the science correctly, and 2) that others (e.g. FrontierMath) already did everything they claim to be doing, so we won't learn anything new here, but somehow 1stproof get all the credit.

New comment by data_maan in "First Proof"

data_maan — Sun, 08 Feb 2026 08:26:23 +0000

If you want to do this rigorously, you should run it as a competition like the guys at the AI-MO Prize are doing on Kaggle.

That way you get all the necessary data.

I still think this is bro science.

New comment by data_maan in "First Proof"

data_maan — Sun, 08 Feb 2026 08:24:48 +0000

> There are some experiments which cannot be carried out more than once

Yes, in which case a very detailed methodology is required: which hardware, runtimes, token counts etc.

This does none of that.

New comment by data_maan in "First Proof"

data_maan — Sun, 08 Feb 2026 08:23:22 +0000

It wasn't like this in any way.

CASP relies on a robust benchmark (not just 10 random proteins), and has clear participation criteria, objective metrics how the eval plays out, etc.

So I stand by my claim: This isn't scientific. If CASP is Japan, a highly organized & civilized society, this is a banana republic.