Hacker News: shawntan

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Tue, 14 Oct 2025 02:06:09 +0000

The question I keep coming back to is whether ARC-AGI is intended to evaluate generalisation to the task at hand. This would then mean that the test data has a meaningful distribution shift from the training data, and only a model that can perform said generalisation can do well.

This would all go out the window if the model being evaluated can _see_ the type of distribution shift it would encounter during test time. And it's unclear whether the shift is the same in the hidden set.

There are questions about the evaluations that arise from the large model performance against the smaller models, especially given the ablation studies. Are the large models trained on the same data as these tiny models? Should they be? If they shouldn't, then why are we allowing these small models access to these in their training data?

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Tue, 14 Oct 2025 02:00:50 +0000

This would not help if no proper constraints are established on what data can and cannot be trained on. And maybe just figuring out what the goal of the benchmark is.

If it is to test generalisation capability, then what data the model being evaluated is trained on is crucial to making any conclusions.

Look at the construction of this synthetic dataset for example: https://arxiv.org/pdf/1711.00350

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Tue, 14 Oct 2025 01:55:48 +0000

You can have benchmarks with specifically constructed train-test splits for task-specific models. Train only on the train, then your results on test should be what is reported.

You can still game those benchmarks (tune your hyperparameters after looking at test results), but that setting measures for generalisation on the test set _given_ the training set specified. Using any additional data should be going against the benchmark rules, and should not be compared on the same lines.

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Thu, 09 Oct 2025 17:50:15 +0000

This is a point I wish more people would recognise.

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Wed, 08 Oct 2025 04:23:47 +0000

I should probably also add: It's long been known that Universal / Recursive Transformers are able to solve _simple_ synthetic tasks that vanilla transformers cannot.

Just check out the original UT paper, or some of it's follow ups: Neural Data Router, https://arxiv.org/abs/2110.07732; Sparse Universal Transformers (SUT), https://arxiv.org/abs/2310.07096.

There is even theoretical justification for why: https://arxiv.org/abs/2503.03961

The challenge is actually scaling them up to be useful as LLMs as well (I describe why it's a challenge in the SUT paper).

It's hard to say with the way ARC-AGI is allowed to be evaluated if this is actually what is at play. My gut tells me, given the type of data that's been allowed in the training set, that some leakage of the evaluation has happened in both HRM and TRM.

But because as a field we've given up on actually carefully ensuring training and test don't contaminate, we just decide it's fine and the effect is minimal. Especially considering LLMs, the test set example leaking into the dataset is merely a drop in the bucket (I don't believe we should be dismissing it this way, but that's a whole 'nother conversation).

With these models that are challenge-targeted, it becomes a much larger proportion of what influences the model behaviour, especially if the open evaluation sets are there for everyone to look at and simply generate more. Now we don't know if we're generalising or memorising.

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Wed, 08 Oct 2025 03:44:10 +0000

> Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

Yes, precisely this. The question is really what is ARC-AGI evaluating for?

1. If the goal is to see if models can generalise to the ARC-AGI evals, then models being evaluated on it should not be trained on the tasks. Especially IF ARC-AGI evaluations are constructed to be OOD from the ARC-AGI training data. I don't know if they are. Further, there seems to be usage of the few-shot examples in the evals to construct more training data in the HRM case. TRM may do this via the training data via other means.

2. If the goal is that even _having seen_ the training examples, and creating more training examples (after having peeked at the test set), these evaluations should still be difficult, then the ablations show that you can get pretty far without universal/recurrent Transformers.

If 1, then I think the ARC-prize organisers should have better rules laid out for the challenge. From the blog post, I do wonder how far people will push the boundary (how much can I look at the test data to 'augment' my training data?) before the organisers say "This is explicitly not allowed for this challenge."

If 2, the organisers of the challenge should have evaluated how much of a challenge it would actually have been allowing extreme 'data augmentation', and maybe realised it wasn't that much of a challenge to begin with.

I tend to agree that, given the outcome of both the HRM and this paper, is that the ARC-AGI folks do seem to allow this setting, _and_ that the task isn't as "AGI complete" as it sets out to be.

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Tue, 07 Oct 2025 20:08:31 +0000

Right. There should really be a vanilla Transformer baseline.

With recurrence: The idea has been around: https://arxiv.org/abs/1807.03819

There are reasons why it hasn't really been picked up at scale, and the method tends to do well on synthetic tasks.

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Tue, 07 Oct 2025 20:00:01 +0000

That analysis provided a very non-abrasive wording of their evaluation of HRM and its contributions. The comparison with a recursive / universal transformer on the same settings is telling.

"These results suggest that the performance on ARC-AGI is not an effect of the HRM architecture. While it does provide a small benefit, a replacement baseline transformer in the HRM training pipeline achieves comparable performance."

New comment by shawntan in "Less is more: Recursive reasoning with tiny networks"

shawntan — Tue, 07 Oct 2025 19:51:55 +0000

I think everyone should read the post from ARC-AGI organisers about HRM carefully: https://arcprize.org/blog/hrm-analysis

With the same data augmentation / 'test time training' setting, the vanilla Transformers do pretty well, close to the "breakthrough" HRM reported. From a brief skim, this paper is using similar settings to compare itself on ARC-AGI.

I too, want to believe in smaller models with excellent reasoning performance. But first understand what ARC-AGI tests for, what the general setting is -- the one that commercial LLMs use to compare against each other -- and what the specialised setting HRM and this paper uses as evaluation.

The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper.

New comment by shawntan in "LLM-Deflate: Extracting LLMs into Datasets"

shawntan — Sat, 20 Sep 2025 17:00:09 +0000

Not sure if you mean in general, but I'll answer both branches of the question.

In general: Depending on the method of compression, you can have lossy or non-lossy compression. Using 7zip on a bunch of text files can lossless-ly compress that data. Briefly, you calculate the statistics of the data you want to compress (the dictionary), and then make the commonly re-occuring chunks describable with fewer bits (encoding). The compressed file basically contains the dictionary and the encoding.

For LLMs: There are ways to use an LLM (or any statistical model of text) to compress text data. But the techniques use similar settings as the above, with a dictionary and an encoding, with the LLM taking the function of a dictionary. When "extracting" data from the dictionary alone, you're basically sampling from the dictionary distribution.

Quantitatively, the "loss" in "lossy" being described is literally the number of bits used for the encoding.

I wrote a brief description here of techniques from an undergrad CS course that can be used: https://blog.wtf.sg/posts/2023-06-05-yes-its-just-doing-comp...

New comment by shawntan in "LLM-Deflate: Extracting LLMs into Datasets"

shawntan — Sat, 20 Sep 2025 16:50:05 +0000

The compression is lossy.

New comment by shawntan in "Launch HN: Channel3 (YC S25) – A database of every product on the internet"

shawntan — Thu, 21 Aug 2025 13:02:07 +0000

Sup!

New comment by shawntan in "Launch HN: Channel3 (YC S25) – A database of every product on the internet"

shawntan — Thu, 21 Aug 2025 03:06:40 +0000

2nd employee at Semantics3 here. Considering all the AI available today I think things like product disambiguation becomes wayyy easier. We were trying many tricks and heuristics to identify the same products across sites.

New comment by shawntan in "Compiling LLMs into a MegaKernel: A path to low-latency inference"

shawntan — Thu, 19 Jun 2025 20:48:38 +0000

Systems might want to anticipate changes in LLM architectures (even small changes can make a big difference kernel wise), so it's good to not "bake" too much in ahead of time.

That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.

Not as low-hanging as you might imagine.

New comment by shawntan in "Gemini Diffusion"

shawntan — Thu, 22 May 2025 17:08:06 +0000

I'm curious how the speed is achieved is this is the technique used. Generally I expected this "masked language model" technique to be far slower since the full vocab projection needs to be computed every iteration.

I always thought the eventual technique would be some form of diffusion in continuous space, then decoding into the discrete tokens.

Also I'm guessing this is a "best guess" of how Gemini Diffusion is done?

New comment by shawntan in "RWKV Language Model"

shawntan — Thu, 02 Jan 2025 19:04:23 +0000

The formulations in attention as rnn have similar issues as rwkv. Fundamentally it's a question of what we call an RNN.

Personally I think it's important not to call some of these recent architectures RNNs because they have theoretical properties that do not match (read: they're worse) what we've "classically" called RNNs.

Ref: https://arxiv.org/abs/2404.08819

As a rule of thumb: you generally don't get parallelism for free, you pay for it with poorer expressivity.

New comment by shawntan in "RWKV Language Model"

shawntan — Thu, 02 Jan 2025 14:55:23 +0000

Although marketed as such, RWKV isn't really an RNN.

In the recent RWKV7 incarnation, you could argue it's a type of Linear RNN, but past versions had an issue of taking its previous state from a lower layer, allowing for parallelism, but makes it closer to a convolution than a recurrent computation.

As for 1), I'd like to believe so, but it's hard to get people away from the addictive drug that is the easily parallelised transformer, 2) (actual) RNNs and attention mechanisms to me seem fairly powerful (expressivity wise) and perhaps most acceptable by the community.

New comment by shawntan in "Chain of Thought empowers transformers to solve inherently serial problems"

shawntan — Tue, 17 Sep 2024 03:43:57 +0000

> The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens.

You're right.

Re: NAND of two inputs. Isn't this doable even by a single layer (no hidden layers) neural network?

Re: Polynomial computable function. I'm assuming this makes no assumption of constant-depth.

Because my entire point was that the result of this paper is not actually impressive AND covered by a previous paper. Hopefully that's clearer.

New comment by shawntan in "Chain of Thought empowers transformers to solve inherently serial problems"

shawntan — Tue, 17 Sep 2024 02:44:34 +0000

If a "problem we care about" is not stated as a formal language, does it mean it does not exist in the hierarchy of formal languages? Or is it just as yet unclassified?

New comment by shawntan in "Chain of Thought empowers transformers to solve inherently serial problems"

shawntan — Tue, 17 Sep 2024 02:35:46 +0000

Using CoT implicitly increases the depth of the circuit. But yes, poorly worded.