<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: psb217</title><link>https://news.ycombinator.com/user?id=psb217</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 05 Jun 2026 23:33:43 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=psb217" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by psb217 in "Natural Language Autoencoders: Turning Claude's Thoughts into Text"]]></title><description><![CDATA[
<p>It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.<p>If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).</p>
]]></description><pubDate>Fri, 08 May 2026 10:32:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=48061192</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=48061192</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48061192</guid></item><item><title><![CDATA[New comment by psb217 in "Anthropic's original take home assignment open sourced"]]></title><description><![CDATA[
<p>Yeah, I assume it was partly chosen since the problem structure provides some convenient hooks for selectively introducing subtle and less subtle inefficiencies  in the baseline algorithm that match common optimization patterns.</p>
]]></description><pubDate>Wed, 21 Jan 2026 10:56:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=46703822</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=46703822</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46703822</guid></item><item><title><![CDATA[New comment by psb217 in "The Q, K, V Matrices"]]></title><description><![CDATA[
<p>Per your point 4, some current hyped work is pushing hard in this direction [1, 2, 3]. The basic idea is to think of attention as a way of implementing an associative memory. Variants like SDPA or gated linear attention can then be derived as methods for optimizing this memory online such that a particular query will return a particular value. Different attention variants correspond to different ways of defining how the memory produces a value in response to a query, and how we measure how well the produced value matches the desired value.<p>Some of the attention-like ops proposed in this new work are most simply described as implementing the associative memory with a hypernetwork that maps keys to values with weights that are optimized at test time to minimize value retrieval error. Like you suggest, designing these hypernetworks to permit efficient implementations is tricky.<p>It's a more constrained interpretation of attention than you're advocating for, since it follows the "attention as associative memory" perspective, but the general idea of test-time optimization could be applied to other mechanisms for letting information interact non-linearly across arbitrary nodes in the compute graph.<p>[1] <a href="https://arxiv.org/abs/2501.00663" rel="nofollow">https://arxiv.org/abs/2501.00663</a><p>[2] <a href="https://arxiv.org/abs/2504.13173" rel="nofollow">https://arxiv.org/abs/2504.13173</a><p>[3] <a href="https://arxiv.org/abs/2505.23735" rel="nofollow">https://arxiv.org/abs/2505.23735</a></p>
]]></description><pubDate>Thu, 08 Jan 2026 10:48:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=46539558</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=46539558</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46539558</guid></item><item><title><![CDATA[New comment by psb217 in "DeepSeek OCR"]]></title><description><![CDATA[
<p>Yes, you can get good compression of a long sequence of "base" text tokens into a shorter sequence of "meta" text tokens, where each meta token represents the information from multiple base tokens. But, grouping a fixed number of base tokens into each meta token isn't ideal, since that won't align neatly with sensible semantic boundaries, like words, phrases, sentences, etc. So, the trick is how decide which base tokens should be grouped into each meta token....<p>This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.<p>[1] - <a href="https://arxiv.org/abs/2507.07955" rel="nofollow">https://arxiv.org/abs/2507.07955</a></p>
]]></description><pubDate>Wed, 22 Oct 2025 09:34:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=45666709</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=45666709</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45666709</guid></item><item><title><![CDATA[New comment by psb217 in "DeepSeek OCR"]]></title><description><![CDATA[
<p>The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.</p>
]]></description><pubDate>Mon, 20 Oct 2025 08:24:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=45641253</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=45641253</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45641253</guid></item><item><title><![CDATA[New comment by psb217 in "The maths you need to start understanding LLMs"]]></title><description><![CDATA[
<p>That past work will pay off even more when you start looking into diffusion and flow-based models for generating images, videos, and sometimes text.</p>
]]></description><pubDate>Sat, 06 Sep 2025 17:09:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=45151021</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=45151021</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45151021</guid></item><item><title><![CDATA[New comment by psb217 in "Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]"]]></title><description><![CDATA[
<p>I think there's an implicit assumption here that interaction with the world is critical for effective learning. In that case, you're bottlenecked by the speed of the world... when learning with a single agent. One neat thing about artificial computational agents, in contrast to natural biological agents, is that they can share the same brain and share lived experience, so the "speed of reality" bottleneck is much less of an issue.</p>
]]></description><pubDate>Thu, 03 Jul 2025 09:07:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=44453087</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44453087</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44453087</guid></item><item><title><![CDATA[New comment by psb217 in "The Death of the Middle-Class Musician"]]></title><description><![CDATA[
<p>But, if empirically our current system for net wealth creation tends to also produce wealth concentration, it makes sense to consider ways of modifying the system to mitigate some of the wealth concentration while maintaining as much of the wealth creation as possible.</p>
]]></description><pubDate>Sun, 29 Jun 2025 09:36:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=44411625</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44411625</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44411625</guid></item><item><title><![CDATA[New comment by psb217 in "Sam Altman says Meta offered OpenAI staffers $100M bonuses"]]></title><description><![CDATA[
<p>Most of the people pursued in these "AI talent wars" are folks deeply involved in training or developing infrastructure for training LLMs at whatever level is currently state-of-the-art. Due to the resources required for projects that can provide this sort of experience, the pool of folks with this experience is limited to those with significant clout in orgs with money to burn on LLM projects. These people are expensive to hire, and can kind of run through a loop of jumping from company to company in an upward compensation spiral.<p>Ie, the skills aren't particularly complicated in principle, but the conditions needed to acquire them aren't widely available, so the pool of people with the skills is limited.</p>
]]></description><pubDate>Wed, 18 Jun 2025 17:09:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=44311603</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44311603</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44311603</guid></item><item><title><![CDATA[New comment by psb217 in "Meta invests $14.3B in Scale AI to kick-start superintelligence lab"]]></title><description><![CDATA[
<p>Comparing the process of research to tending a garden or raising children is fairly common. This is an iteration on that theme. One thing I find interesting about this analogy is that there's a strong sense of the model's autoregressiveness here in that the model commits early to the gardening analogy and then finds a way to make it work (more or less).<p>The sorts of useful analogies I was mostly talking about are those that appear in scientific research involving actionable technical details. Eg, diffusion models came about when folks with a background in statistical physics saw some connections between the math for variational autoencoders and the math for non-equilibrium thermodynamics. Guided by this connection, they decided to train models to generate data by learning to invert a diffusion process that gradually transforms complexly structured data into a much simpler distribution -- in this case, a basic multidimensional Gaussian.<p>I feel like these sorts of technical analogies are harder to stumble on than more common "linguistic" analogies. The latter can be useful tools for thinking, but tend to require some post-hoc interpretation and hand waving before they produce any actionable insight. The former are more direct bridges between domains that allow direct transfer of knowledge about one class of problems to another.</p>
]]></description><pubDate>Sat, 14 Jun 2025 01:20:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=44273567</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44273567</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44273567</guid></item><item><title><![CDATA[New comment by psb217 in "Meta Invests $14.3B in Scale AI to Kick-Start Superintelligence Lab"]]></title><description><![CDATA[
<p>I think you misunderstood what I meant about setting a high bar. First, passing the bar is a necessary but not sufficient condition for superintelligence. Secondly, by "fair for" I meant it's fair to set a high bar, not that this particular bar is the one fair bar for measuring intelligence. It's obvious that usefulness of an analogy generator is a matter of degree. Eg, a uniform random string generator is guaranteed to produce all possible insightful analogies, but would not be considered useful or intelligent.<p>I think you're basically agreeing with me. Ie, current models are not superintelligent. Even though they can "think" super fast, they don't pass a minimum bar of producing novel and useful connections between domains without significant human intervention. And, our evaluation of their abilities is clouded by the way in which their intelligence differs from our own.</p>
]]></description><pubDate>Fri, 13 Jun 2025 19:30:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=44271498</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44271498</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44271498</guid></item><item><title><![CDATA[New comment by psb217 in "Meta invests $14.3B in Scale AI to kick-start superintelligence lab"]]></title><description><![CDATA[
<p>I don't think current models are capable of making abstract links across domains. They can latch onto superficial similarities, but I have yet to see an instance of a model making an unexpected and useful analogy. It's a high bar, but I think that's fair for declaring superintelligence.<p>In general, I agree that these models are in some sense extremely knowledgeable, which suggests they are ripe for producing productive analogies if only we can figure out what they're missing compared to human-style thinking. Part of what makes it difficult to evaluate the abilities of these models is that they are wildly superhuman in some ways and quite dumb in others.</p>
]]></description><pubDate>Fri, 13 Jun 2025 17:44:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=44270551</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44270551</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44270551</guid></item><item><title><![CDATA[New comment by psb217 in "Meta invests $14.3B in Scale AI to kick-start superintelligence lab"]]></title><description><![CDATA[
<p>I'd say superintelligence is more about producing deeper insight, making more abstract links across domains, and advancing the frontiers of knowledge than about doing stuff faster. Thinking speed correlates with intelligence to some extent, but at the higher end the distinction between speed and quality becomes clear.</p>
]]></description><pubDate>Fri, 13 Jun 2025 16:02:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=44269748</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44269748</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44269748</guid></item><item><title><![CDATA[New comment by psb217 in "Why Bell Labs Worked"]]></title><description><![CDATA[
<p>You wouldn't get 5 years to noodle -- maybe 1 or 2 at best. You're competing for your next thing against other smart folks who are going hard on maximizing publication rate and grant winning in their current thing. To continue with your riskier, bigger thinking you'd have to be ready to bet that: (i) you'll produce a highly impactful result before you start applying for your next thing and (ii) the high impactfulness of that result will be recognized in time to support your applications.<p>The most successful folks tend to mix talent and hard work with a bit of luck in terms of early gold striking to gain a quick boost of credibility that helps them draw other people into their fold (eg, grad students in a big lab) who can handle a lot of the metric maxxing to free up some (still not enough) time for more ambitious thinking.</p>
]]></description><pubDate>Fri, 06 Jun 2025 17:31:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=44203159</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44203159</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44203159</guid></item><item><title><![CDATA[New comment by psb217 in "ReasoningGym: Reasoning Environments for RL with Verifiable Rewards"]]></title><description><![CDATA[
<p>One challenge with this line of argument is that the base model assigns non-zero probability to all possible sequences if we ignore truncation due to numerical precision. So, in a sense you could say any performance improvement is due to shifting probability mass towards good reasoning behaviors and away from bad ones that were already present in the base model.<p>I agree with your general point though. Ie, we need more thorough empirical investigation of how reasoning behavior evolves during RL training starting from the base model. And, current RL training results seem more like "amplifying existing good behavior" than "inducing emergent good behavior".</p>
]]></description><pubDate>Mon, 02 Jun 2025 18:55:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=44161860</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44161860</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44161860</guid></item><item><title><![CDATA[New comment by psb217 in "Why walking is the most underrated form of exercise (2017)"]]></title><description><![CDATA[
<p>Yeah. It's easy to get over 3000 total daily calories if you have, eg, an hour of cycle commute per day and then add some purposeful gym or running on top.</p>
]]></description><pubDate>Wed, 21 May 2025 19:16:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=44055136</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44055136</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44055136</guid></item><item><title><![CDATA[New comment by psb217 in "Why walking is the most underrated form of exercise (2017)"]]></title><description><![CDATA[
<p>The best way to hit 3000 is cycling. A reasonably fit (70kg-100kg) cyclist should burn 600-800 cal/hr riding at a moderate pace, so 3000 is a 4-5hr ride. It wouldn't be unusual for an enthusiastic amateur cyclist to hit that 1-2x/week.</p>
]]></description><pubDate>Wed, 21 May 2025 15:33:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=44052596</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=44052596</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44052596</guid></item><item><title><![CDATA[New comment by psb217 in "How linear regression works intuitively and how it leads to gradient descent"]]></title><description><![CDATA[
<p>To be fair, the "trick" part of the kernel trick involves implicitly transforming the data into a higher dimensional space and then fitting a linear function in that space. Ie, you're transforming the inputs so that a linear function from inputs to outputs fits better than if you didn't do the transform.<p>The "trick" allows you to fit a linear function in that higher dimensional space without any potentially costly explicit computation in the higher dimensional space based on the observation that the optimal solution's parameters can be represented as a sum of the higher dimensional representations of points in the training set.</p>
]]></description><pubDate>Thu, 08 May 2025 18:08:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=43929313</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=43929313</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43929313</guid></item><item><title><![CDATA[New comment by psb217 in "Does RL Incentivize Reasoning in LLMs Beyond the Base Model?"]]></title><description><![CDATA[
<p>Offhand, I don't know any specific examples for LLMs. In general though, if you google something like "automated curriculum design for reinforcement learning", you should find some relevant references.<p>Some straightforward scenarios are in, eg, robotics where one can design sequences of increasingly difficult instances of a task like moving objects from one storage bin to another. The basic idea is that the agent would have no reward or learning signal if it jumped straight into the full version of the task, so you let it develop competence on simpler variants and gradually increase difficulty until the agent can get useful learning signal on the full task.</p>
]]></description><pubDate>Tue, 22 Apr 2025 18:34:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=43764975</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=43764975</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43764975</guid></item><item><title><![CDATA[New comment by psb217 in "Does RL Incentivize Reasoning in LLMs Beyond the Base Model?"]]></title><description><![CDATA[
<p>That depends a bit on the length of the RL training and the distribution of problems you're training on. You're correct that RL won't get any "traction" (via positive rewards) on problems where good behavior isn't already in the model's behavior distribution.<p>However, if you're training on many problems, it's possible in principle that if you have traction on _any_ of the problems, then the learning signal you get from success on those problems will have a positive effect on the model's behavior on other problems. Ie, the learning that you do on problems where the model is already producing positive reward behavior will nudge the model towards producing positive reward behavior on problems where it wasn't previously doing so.</p>
]]></description><pubDate>Tue, 22 Apr 2025 14:36:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=43762725</link><dc:creator>psb217</dc:creator><comments>https://news.ycombinator.com/item?id=43762725</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43762725</guid></item></channel></rss>