Hacker News: psb217

New comment by psb217 in "Munich 1991: The Roots of the Current AI Boom"

psb217 — Mon, 22 Jun 2026 11:06:21 +0000

"if you haven't read them you also shouldn't cite them" -- this is wildly incorrect in an academic context. If I'm using ResNets, I should cite the original ResNet paper, even if I haven't read it. If I'm using Transformers, I should cite the original Transformer paper, even if I haven't read it. If my work is a direct extension of method B, and method B is a direct extension of method A, I should cite the source of A, even if I haven't read it.

You can't claim independence from past work simply because you didn't look directly at it. The job of an academic researcher is to know the landscape of relevant ideas, where they come from, where they're going, and to hopefully contribute a few new good ones.

Citation chains should extend back from your work, along a reasonable line conceptual inheritance, back to a reasonable point of origin. Schmidhuber has different definitions for both of these reasonables than the bulk of the ML research community, to a point that makes him difficult to satisfy.

New comment by psb217 in "Natural Language Autoencoders: Turning Claude's Thoughts into Text"

psb217 — Fri, 08 May 2026 10:32:40 +0000

It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.

If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).

New comment by psb217 in "Anthropic's original take home assignment open sourced"

psb217 — Wed, 21 Jan 2026 10:56:09 +0000

Yeah, I assume it was partly chosen since the problem structure provides some convenient hooks for selectively introducing subtle and less subtle inefficiencies in the baseline algorithm that match common optimization patterns.

New comment by psb217 in "The Q, K, V Matrices"

psb217 — Thu, 08 Jan 2026 10:48:34 +0000

Per your point 4, some current hyped work is pushing hard in this direction [1, 2, 3]. The basic idea is to think of attention as a way of implementing an associative memory. Variants like SDPA or gated linear attention can then be derived as methods for optimizing this memory online such that a particular query will return a particular value. Different attention variants correspond to different ways of defining how the memory produces a value in response to a query, and how we measure how well the produced value matches the desired value.

Some of the attention-like ops proposed in this new work are most simply described as implementing the associative memory with a hypernetwork that maps keys to values with weights that are optimized at test time to minimize value retrieval error. Like you suggest, designing these hypernetworks to permit efficient implementations is tricky.

It's a more constrained interpretation of attention than you're advocating for, since it follows the "attention as associative memory" perspective, but the general idea of test-time optimization could be applied to other mechanisms for letting information interact non-linearly across arbitrary nodes in the compute graph.

[1] https://arxiv.org/abs/2501.00663

[2] https://arxiv.org/abs/2504.13173

[3] https://arxiv.org/abs/2505.23735

New comment by psb217 in "DeepSeek OCR"

psb217 — Wed, 22 Oct 2025 09:34:28 +0000

Yes, you can get good compression of a long sequence of "base" text tokens into a shorter sequence of "meta" text tokens, where each meta token represents the information from multiple base tokens. But, grouping a fixed number of base tokens into each meta token isn't ideal, since that won't align neatly with sensible semantic boundaries, like words, phrases, sentences, etc. So, the trick is how decide which base tokens should be grouped into each meta token....

This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.

[1] - https://arxiv.org/abs/2507.07955

New comment by psb217 in "DeepSeek OCR"

psb217 — Mon, 20 Oct 2025 08:24:19 +0000

The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.

New comment by psb217 in "The maths you need to start understanding LLMs"

psb217 — Sat, 06 Sep 2025 17:09:54 +0000

That past work will pay off even more when you start looking into diffusion and flow-based models for generating images, videos, and sometimes text.

New comment by psb217 in "Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]"

psb217 — Thu, 03 Jul 2025 09:07:50 +0000

I think there's an implicit assumption here that interaction with the world is critical for effective learning. In that case, you're bottlenecked by the speed of the world... when learning with a single agent. One neat thing about artificial computational agents, in contrast to natural biological agents, is that they can share the same brain and share lived experience, so the "speed of reality" bottleneck is much less of an issue.

New comment by psb217 in "The Death of the Middle-Class Musician"

psb217 — Sun, 29 Jun 2025 09:36:05 +0000

But, if empirically our current system for net wealth creation tends to also produce wealth concentration, it makes sense to consider ways of modifying the system to mitigate some of the wealth concentration while maintaining as much of the wealth creation as possible.

New comment by psb217 in "Sam Altman says Meta offered OpenAI staffers $100M bonuses"

psb217 — Wed, 18 Jun 2025 17:09:18 +0000

Most of the people pursued in these "AI talent wars" are folks deeply involved in training or developing infrastructure for training LLMs at whatever level is currently state-of-the-art. Due to the resources required for projects that can provide this sort of experience, the pool of folks with this experience is limited to those with significant clout in orgs with money to burn on LLM projects. These people are expensive to hire, and can kind of run through a loop of jumping from company to company in an upward compensation spiral.

Ie, the skills aren't particularly complicated in principle, but the conditions needed to acquire them aren't widely available, so the pool of people with the skills is limited.

New comment by psb217 in "Meta invests $14.3B in Scale AI to kick-start superintelligence lab"

psb217 — Sat, 14 Jun 2025 01:20:28 +0000

Comparing the process of research to tending a garden or raising children is fairly common. This is an iteration on that theme. One thing I find interesting about this analogy is that there's a strong sense of the model's autoregressiveness here in that the model commits early to the gardening analogy and then finds a way to make it work (more or less).

The sorts of useful analogies I was mostly talking about are those that appear in scientific research involving actionable technical details. Eg, diffusion models came about when folks with a background in statistical physics saw some connections between the math for variational autoencoders and the math for non-equilibrium thermodynamics. Guided by this connection, they decided to train models to generate data by learning to invert a diffusion process that gradually transforms complexly structured data into a much simpler distribution -- in this case, a basic multidimensional Gaussian.

I feel like these sorts of technical analogies are harder to stumble on than more common "linguistic" analogies. The latter can be useful tools for thinking, but tend to require some post-hoc interpretation and hand waving before they produce any actionable insight. The former are more direct bridges between domains that allow direct transfer of knowledge about one class of problems to another.

New comment by psb217 in "Meta Invests $14.3B in Scale AI to Kick-Start Superintelligence Lab"

psb217 — Fri, 13 Jun 2025 19:30:13 +0000

I think you misunderstood what I meant about setting a high bar. First, passing the bar is a necessary but not sufficient condition for superintelligence. Secondly, by "fair for" I meant it's fair to set a high bar, not that this particular bar is the one fair bar for measuring intelligence. It's obvious that usefulness of an analogy generator is a matter of degree. Eg, a uniform random string generator is guaranteed to produce all possible insightful analogies, but would not be considered useful or intelligent.

I think you're basically agreeing with me. Ie, current models are not superintelligent. Even though they can "think" super fast, they don't pass a minimum bar of producing novel and useful connections between domains without significant human intervention. And, our evaluation of their abilities is clouded by the way in which their intelligence differs from our own.

New comment by psb217 in "Meta invests $14.3B in Scale AI to kick-start superintelligence lab"

psb217 — Fri, 13 Jun 2025 17:44:27 +0000

I don't think current models are capable of making abstract links across domains. They can latch onto superficial similarities, but I have yet to see an instance of a model making an unexpected and useful analogy. It's a high bar, but I think that's fair for declaring superintelligence.

In general, I agree that these models are in some sense extremely knowledgeable, which suggests they are ripe for producing productive analogies if only we can figure out what they're missing compared to human-style thinking. Part of what makes it difficult to evaluate the abilities of these models is that they are wildly superhuman in some ways and quite dumb in others.

New comment by psb217 in "Meta invests $14.3B in Scale AI to kick-start superintelligence lab"

psb217 — Fri, 13 Jun 2025 16:02:33 +0000

I'd say superintelligence is more about producing deeper insight, making more abstract links across domains, and advancing the frontiers of knowledge than about doing stuff faster. Thinking speed correlates with intelligence to some extent, but at the higher end the distinction between speed and quality becomes clear.

New comment by psb217 in "Why Bell Labs Worked"

psb217 — Fri, 06 Jun 2025 17:31:18 +0000

You wouldn't get 5 years to noodle -- maybe 1 or 2 at best. You're competing for your next thing against other smart folks who are going hard on maximizing publication rate and grant winning in their current thing. To continue with your riskier, bigger thinking you'd have to be ready to bet that: (i) you'll produce a highly impactful result before you start applying for your next thing and (ii) the high impactfulness of that result will be recognized in time to support your applications.

The most successful folks tend to mix talent and hard work with a bit of luck in terms of early gold striking to gain a quick boost of credibility that helps them draw other people into their fold (eg, grad students in a big lab) who can handle a lot of the metric maxxing to free up some (still not enough) time for more ambitious thinking.

New comment by psb217 in "ReasoningGym: Reasoning Environments for RL with Verifiable Rewards"

psb217 — Mon, 02 Jun 2025 18:55:05 +0000

One challenge with this line of argument is that the base model assigns non-zero probability to all possible sequences if we ignore truncation due to numerical precision. So, in a sense you could say any performance improvement is due to shifting probability mass towards good reasoning behaviors and away from bad ones that were already present in the base model.

I agree with your general point though. Ie, we need more thorough empirical investigation of how reasoning behavior evolves during RL training starting from the base model. And, current RL training results seem more like "amplifying existing good behavior" than "inducing emergent good behavior".

New comment by psb217 in "Why walking is the most underrated form of exercise (2017)"

psb217 — Wed, 21 May 2025 19:16:05 +0000

Yeah. It's easy to get over 3000 total daily calories if you have, eg, an hour of cycle commute per day and then add some purposeful gym or running on top.

New comment by psb217 in "Why walking is the most underrated form of exercise (2017)"

psb217 — Wed, 21 May 2025 15:33:44 +0000

The best way to hit 3000 is cycling. A reasonably fit (70kg-100kg) cyclist should burn 600-800 cal/hr riding at a moderate pace, so 3000 is a 4-5hr ride. It wouldn't be unusual for an enthusiastic amateur cyclist to hit that 1-2x/week.

New comment by psb217 in "How linear regression works intuitively and how it leads to gradient descent"

psb217 — Thu, 08 May 2025 18:08:42 +0000

To be fair, the "trick" part of the kernel trick involves implicitly transforming the data into a higher dimensional space and then fitting a linear function in that space. Ie, you're transforming the inputs so that a linear function from inputs to outputs fits better than if you didn't do the transform.

The "trick" allows you to fit a linear function in that higher dimensional space without any potentially costly explicit computation in the higher dimensional space based on the observation that the optimal solution's parameters can be represented as a sum of the higher dimensional representations of points in the training set.

New comment by psb217 in "Does RL Incentivize Reasoning in LLMs Beyond the Base Model?"

psb217 — Tue, 22 Apr 2025 18:34:26 +0000

Offhand, I don't know any specific examples for LLMs. In general though, if you google something like "automated curriculum design for reinforcement learning", you should find some relevant references.

Some straightforward scenarios are in, eg, robotics where one can design sequences of increasingly difficult instances of a task like moving objects from one storage bin to another. The basic idea is that the agent would have no reward or learning signal if it jumped straight into the full version of the task, so you let it develop competence on simpler variants and gradually increase difficulty until the agent can get useful learning signal on the full task.