Hacker News: tysam_and

New comment by tysam_and in "Grokked Transformers Are Implicit Reasoners"

tysam_and — Tue, 28 May 2024 12:16:54 +0000

I sort of wish that we would move on from the "grokking" terminology in the way that the field generally uses it (a magical kind of generalization that may-or-may-not-suddenly-happen if you train for a really long time).

I generally regard grokking as a failure mode in a lot of cases -- it's oftentimes not really a good thing. It tends to indicate that the combination of your network, task, and data are poorly suited for learning {XYZ} thing. There are emergent traits which I think the network can learn in a healthy manner over training, and I think that tends to fall under the 'generalization' umbrella.

Though I'd strongly prefer to call it 'transitive' rather than 'compositional' in terms of generalization, as transitive is the formal term most disciplines use for such things, compositional is a different, more general meaning entirely. Similarly, I'd replace 'parametric' and 'non-parametric' with 'internal' and 'external', etc. Sloughing through the definition salad of words (this paper alone takes up roughly half of the top Kagi hits for 'parametric memory') makes actually interpreting an argument more difficult.

One reinterpretation of the problem is -- of course external memory models will have trouble generalizing to certain things like models relying on internal memory do! This is because, in part, models with internal memory will have much more 'experience' integrating the examples that they've seen, whereas, for an external-memory model like a typical RAG setup, anything is possible.

But, that being said, I don't think you can necessarily isolate that to the type of memory that the model has alone, i.e., I don't think you can clearly say even in a direct comparison between the two motifs that it's the kind of memory itself (internal vs. external) that is to blame for this. I think that might end up leading down some unfruitful research paths if so.

That said, one positive about this paper is the fact that they seem to have found a general circuit that forms for their task, and analyze that, I believe that has value, but (and I know I tend to be harsh on papers generally) the rest of the paper seems to be more of a distraction.

Definitional salad buffets and speculation about the 'in' topics are going to be the things that make the headlines, but in order to make real progress, focusing on the fundamentals is really what's necessary here, I think. They may seem 'boring' a lot of the times, but they've certainly helped me quite a bit in my research. <3 :'))))

New comment by tysam_and in "I couldn't escape poison oak, so I started eating it"

tysam_and — Thu, 23 May 2024 18:29:34 +0000

Nope! Plenty of people get drowsy effects from non-drowsy antihistamines. It is different for everyone (though, again, I am not a doctor!)

New comment by tysam_and in "Llama3 implemented from scratch"

tysam_and — Sun, 19 May 2024 21:33:26 +0000

Heyo! Have been doing this for a while. SSMs certainly are flashy (most popular topics-of-the-year are), and it would be nice to see if they hit a point of competitive performance with transformers (and if they stand the test of time!)

There are certainly tradeoffs to both, the general transformer motif scales very well on a number of axis, so that may be the dominant algorithm for a while to come, though almost certainly it will change and evolve as time goes along (and who knows? something else may come along as well <3 :')))) ).

New comment by tysam_and in "I couldn't escape poison oak, so I started eating it"

tysam_and — Sun, 19 May 2024 15:50:34 +0000

This is incorrect enough as to be dangerous (IMPE, I am not a doctor). They are non-drowsy because they do not cross the blood brain barrier effectively as I understand. Second and third generation antihistamines are fantastic.

New comment by tysam_and in "Show HN: Mazelit - My wife and I released our first game"

tysam_and — Fri, 12 Apr 2024 22:40:08 +0000

Or even $7.99, that or $8.99 is sort of a nice line between signaling "very cheap game" and "potentially short but enjoyable experience for the evening worth the gamble to find out if so".

I can't speak to it in general, that's just my 2c on the matter/issue without really knowing too terribly much about the game (or game development in generally, really, I'm just a consumer here! XD). <3 :'))))

New comment by tysam_and in "Schedule-Free Learning – A New Way to Train"

tysam_and — Sun, 07 Apr 2024 05:44:03 +0000

yeah it's been crazy to see how things have changed and im really glad that theres still interest in optimizing things for these benchmarks. ;P keller's pretty meticulous and has put in a lot of work for this from what i understand. im not sure where david's code came from originally, but it definitely impacted my code as i referenced it heavily when writing mine, and keller rewrote a lot of my code with his style + the improvements that he made in turn. hopefully the pedigree of minimal code can continue as a tradition, it really has a surprising impact

96 legitimately is pretty hard, i struggled doing it even in 2 minutes, so seeing it in 45 seconds is crazy. definitely gets exponentially harder for every fraction of a percent, so i think that's a pretty big achievement to hit :D

New comment by tysam_and in "Schedule-Free Learning – A New Way to Train"

tysam_and — Sun, 07 Apr 2024 03:25:01 +0000

Yeah, I saw the work from @Sree_Harsha_N, though that accuracy plot on the Adam/SGD side of things is very untuned, it was about what one could expect from an afternoon of working with it, but as far as baselines go most people in the weeds with optimizers would recognize that it's pretty not-good for comparison (not to dump on the reproduction efforts).

Hence why I think it might be hard to accurately compare them, likely SGD and Adam/AdamW are going to have better potential top ends but are going to get more thrashed in public comparisons vs an optimizer that seems to perform more flatly overall. Aaron works at FAIR so I am assuming that he knows this, I reached out with some concerns on my end a little bit before he published the optimizer but didn't hear back either unfortunately.

New comment by tysam_and in "Schedule-Free Learning – A New Way to Train"

tysam_and — Sun, 07 Apr 2024 02:57:51 +0000

hey dont forget about david me and keller (he is currently the champ and has good pareto configs for not just 94 but also 95 and 96 % : https://github.com/KellerJordan/cifar10-airbench)

New comment by tysam_and in "Schedule-Free Learning – A New Way to Train"

tysam_and — Sat, 06 Apr 2024 20:16:50 +0000

This is a pretty hyped-up optimizer that seems to have okay-ish performance in-practice, but there are a number of major red flags here. For one, the baselines are decently sandbagged, but the twitter posts sharing them (which are pretty hype-y) directly says that the baselines are "highly tuned" and that there's no benchmark trickery (which is flat-out wrong). If someone has not had experience with said benchmarks, it is a plausible statement, having worked with some these datasets very closely, some of the baselines are simply terrible, I don't know where they came from.

Additionally, the optimizer does actually appear to have a kind of momentum, despite claims directly saying the contrary, and uses it with a nesterov-like step (line 2 of 3 in the inner loop). Finally, it is 'schedule-free' because the schedule is actually hardcoded into the algorithm itself -- 1./steps_taken which is not necessarily a rare learning rate schedule. This is a decently robust but sometimes suboptimal schedule, and I find it sketchy to make claims that it is 'schedule-free'. This also cripples the optimizer by tying performance to the number of steps taken -- which is potentially a problem if you are using any batchsize+lr scaling strategies as I understand.

There is a mixture of hype and substance here, and I wish the author was more straightforward with their approach and claims. I think there is the potential for a good "bolts-included" optimizer with some of the ideas being presented here, but the amount of overhyping and deception makes me not want to trust any of the following work coming.

Unfortunately, hype is what sells best on Twitter, and some of the claims being made here appear to be at the very best deceptive, and at the very worst, untrue. I could be wrong -- these are just my personal opinions from my own experience, but I do occasionally find myself distraught about the things that tend to catch wind in the technical news cycle.

-Fern

New comment by tysam_and in "Vesuvius Challenge 2023 Grand Prize awarded: we can read the first scroll"

tysam_and — Tue, 06 Feb 2024 05:49:11 +0000

Funding is a huge one as well. Funding is the wheel that drives the project (source, have been hanging around the project people for a little while).

If you know anyone that would help chip in for the Phase 2 of the project (scaling up, please let Nat know! (not directly affiliated with the project management team, just pointing to him as a great contact for that.... <3 :')))) ) )

New comment by tysam_and in "Beyond self-attention: How a small language model predicts the next token"

tysam_and — Tue, 06 Feb 2024 00:38:12 +0000

I get the feeling you may not have read the paper as closely as you could have! Section 8 followed by Section 2 may look a tiny bit different if you consider it from this particular perspective.... ;)

New comment by tysam_and in "Beyond self-attention: How a small language model predicts the next token"

tysam_and — Tue, 06 Feb 2024 00:34:50 +0000

Yes! This is a consequence of empirical risk minimization via maximum likelihood estimation. To have a model not reproduce the density of data it trained on would be like trying to get a horse and buggy to work well at speed, "now just without the wheels this time". It would generally not necessarily go all that well, I think! :'D

New comment by tysam_and in "Beyond self-attention: How a small language model predicts the next token"

tysam_and — Tue, 06 Feb 2024 00:33:16 +0000

I wish that this worked out in the long run! However, watching the field spin its wheels in the mud over and over with silly pet theories and local results makes it pretty clear that a lot of people are just chasing the butterfly, then after a few years grow disenchanted and sort of just give up.

The bridge comes when people connect concepts to those that are well known and well understood, and that is good. It is all well and good to say in theory that rediscovering things is bad -- it is not necessarily! But when it becomes groundhog day for years on end without significant theoretical change, then that is an indicator that something is amiss in general in how we learn and interpret information in the field.

Of course, this is just my crotchety young opinion coming up on 9 years in the field, so please take that that with a grain of salt and all that.

New comment by tysam_and in "Beyond self-attention: How a small language model predicts the next token"

tysam_and — Mon, 05 Feb 2024 01:57:41 +0000

I appreciate the effort that went into this visualization, however, as someone who has worked with neural networks for 9 years, I found it far more confusing than helpful. I believe it was due to trying to present all items at once instead of deferring to abstract concepts, however, I am not entirely sure of this fact. <3 :'))))

New comment by tysam_and in "Beyond self-attention: How a small language model predicts the next token"

tysam_and — Mon, 05 Feb 2024 01:55:14 +0000

Some of the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... ! If we do not have read the foundations of the field that we are in, we are doomed to be mystified by unexplained phenomena which arise pretty naturally as consequences of already-distilled work!

That said, the experiments seem very thorough, on a first pass/initial cursory examination, I appreciate the amount of detail that seemed to go into them.

The tradeoff between learning existing theory, and attempting to re-derive it from scratch, I think, is a hard tradeoff, as not having the traditional foundation allows for the discovery of new things, but having it allows for a deeper understanding of certain phenomena. There is a tradeoff either way.

I've seen several people here in the comments seemingly shocked that a model that maximizes the log likelihood of a sequence given the data somehow does not magically deviate from that behavior when run in inference. It's a density estimation model, do you want it to magically recite Shakespeare from the void?

Please! Let's stick to the basics, it will help experiments like this make much more sense as there already is a very clear mathematical foundation which clearly explains it (and said emergent phenomena).

If you want more specifics, there are several layers, Shannon's treatment of ergodic systems is a good start (though there is some minor deviation from that here, but it likely is a 'close enough' match to what's happening here to be properly instructive to the reader about the general dynamics of what is going on, overall.)

New comment by tysam_and in "'Stupid,' 'shameful:' Tech workers on Y Combinator CEO Garry Tan's rant"

tysam_and — Fri, 02 Feb 2024 05:27:04 +0000

This message confused me on a few dimensions, so I translated it a bit:

"State subjective perspective as objective fact. Cast shame upon the OP for not pre-aligning with said belief. Put the responsibility on the OP to prove that they are not deserving of shame."

I grew up in an environment where this kind communication was sort of the default, hence why I was curious and wanted to drill down a bit and give it some thought. Of course, many people agree that Twitter is more unhealthy than healthy. But that's not entirely the point here, I think.

New comment by tysam_and in "Self-Consuming Generative Models Go MAD"

tysam_and — Thu, 18 Jan 2024 07:34:53 +0000

This is, among other things, a very natural consequence of some of the equations surrounding and involved in Shannon's original noisy channel capacity theorem, where the noise is (in many ways) conditioned upon the structure of the model itself.

It is not at all necessarily surprising, I think, from a purely high-level perspective, but I do personally think that I find that it is good to have the analysis. From a purely professional standpoint, I do not believe it is unique or distinctive enough as an individual method to need its own separate name for day-to-day use. From a personal perspective, however, I thought the mad cow disease reference was hilarious and applaud whoever came up with the acronym.

I find the benefit in the analysis, and the concerns presented about generated data being present in the data makes sense to me (and if in sufficient quantity, would make sense as biasing the models improperly in a rather significant kind of way).

I particularly enjoyed the humor of this line, the tongue-in-cheek nature is very funny/nice to me here:

"Ascertaining whether an autophagous loop has gone MAD or not (recall Definition 2.1) requires that we measure how far the synthesized data distribution Gt has drifted from the true data distribution Pr over the generations t."

I like their use of color in the paper, I saw a similar orange/green color scheme earlier today and enjoyed it very much as an annotation method.

"A fixed real dataset only slows generative model degradation" is again also a natural consequence of Shannon's noisy channel capacity theorem, one can say that with almost nearly perfect certainty that a limited neural network will not be able to perfectly fit the distribution of the data that it is training on, thus it will have bias, variance, or some combination of both, limited ultimately by the model's capacity itself.

This w.r.t. the original dataset is noise, and we can choose between whether we want collapse, or recursively encoding the noise patterns of the previous model (which might happen to have an additive effect, or maybe not! Who knows! I do not know for sure here, I have not yet figured this one out myself yet).

w.r.t. the real data slowing down degradation, if we are sampling I.I.D. of course then proportionately we still should see some degradation as this is the nature of empirical risk minimization over maximum likelihood estimation. It is still good that they have shown this, however, I thinks.

The fresh data loop, I believe, would be an example of actually a kind of noise in and of itself, w.r.t. the original input dataset, and as long as this 'noise' (from the perspective of the model) has a higher SNR than the (potentially slow) collapse of the model's output distribution, then it should (in some kind of proportion at leasts) be constantly-playing 'keep-up' with the fresh data.

"First, we find that—regardless of the performance of early generations—the performance of later generations converges to a point that depends only on the amounts of real and synthetic data in the training loop. " -- there we are (I saw this after making the SNR point, this makes sense within this framework of interpretation, then.

All in all, I found this paper very aware of itself and what it was studying, it was well-laid out and accessible, and while the points are not necessarily earth-shattering (though I still have to read through some of it, I think), having clear empirical evidence about this phenomenon, detailing it, and cutting away through the forest of (at-least-seemingly) untested battlegrounds is one that I appreciated.

Curious to hear what others think about this one. <3 :'))))

New comment by tysam_and in "FAQ on Leaving Google"

tysam_and — Thu, 18 Jan 2024 07:13:00 +0000

Yes! Playing through the rote action exchange can be rather exhausting, especially if I've already bridged that connection and know the person -- there's not much reason for it, and it can be exhausting!

Unfortunately, with where my past is, a whole lot of my family too has the idea that I'm living a distorted life, and that this needs to be corrected (almost as a first priority thing). There's almost an Animal-Farm-istic "All sins are equal, but some sins are more equal than others" kind of thing going on there, if you catch what I mean.

Intellectually, I think many of them can understand how this is not really the most rational thing given the on-paper beliefs, but emotionally, it's a very different story, and the emotions seem to win out on that front.

Answering the basics isn't too terrible for me, though it definitely can be a problem if it's the only focus (and if the conversation inevitably keeps looping around to that singular topic. I am a freaking human being, darnmnitall!!)

New comment by tysam_and in "FAQ on Leaving Google"

tysam_and — Thu, 18 Jan 2024 01:13:22 +0000

I mean, again, that's not really the point that I was making. I'm talking about the foundational emotional need of connection, not everyone connects well in that manner, the quality of the response to the question doesn't always have a huge bearing on that.

Like, sure, sometimes it is good bonding, and sometimes it's not, it's very much context dependent.

If having the emotional security of not being in the hotseat answering questions from family members is necessary for an amount of emotional security on the OP's part, then I would consider that to be a good strategy. It might not be what you would do in that scenario, which is okay, as you and OP are different and might have different methods of addressing and meeting your respective emotional needs.

New comment by tysam_and in "FAQ on Leaving Google"

tysam_and — Wed, 17 Jan 2024 23:33:43 +0000

Well they can find alternative methods then that are less frazzling, there are fewer things worse than not feeling seen due to only answering questions!

I know it can be good, but sometimes the questions can legitimately get in the way of connection and spending quality time, and not everyone wants to have the hard conversation while being in the hotseat (especially not over, and over, and over again. I am transgender, for example, and while having 1 mildly hostile family member would be a somewhat-problem, most of my extended family only wants to talk about that thing, and that one thing, with me, to the point where it effectively creates a wall. That at least is my experience of the issue, it's not quite the same, but I've definitely experienced the "questions dynamic" within other, much-more-mild scenarios, and generally, IMPE, I really dislike it unless I'm actively getting something interesting out of it, which I'm oftentimes not! It can be very much isolating, as far as my personal experience goes.)

So, not really a terrible solution, I think! <3 :'))))