I think this means that when training a cat detector it's better to have more bobcats and lynx and fewer dogs.
Grokking is fascinating! It seems tied to how neural networks hit critical points in generalization. Could this concept also enhance efficiency in models dealing with non-linearly separable data?
Grokking is so cool. What does it even mean that grokking exhibits similarities to criticality? As in, what are the philosophical ramifications of this?
Wow, fascinating stuff and "grokking" is news to me. Thanks for sharing! In typical HN fashion, I'd like to come in as an amateur and nitpick the terminology/philosophy choices of this nascent-yet-burgeoning subfield:
We begin by examining the optimal generalizing solution, that indicates the network has properly learned the task... the network should put all points in Rd on the same side of the separating hyperplane, or in other words, push the decision boundary to infinity... Overfitting occurs when the hyperplane is only far enough from the data to correctly classify all the training samples.
This is such a dumb idea on first glance, I'm so impressed that they pushed past that and used it for serious insights. It truly is a kind of atomic/fundamental/formalized/simplified way to explore overfitting on its own.Ultimately their thesis, as I understand it from the top of page 5, is roughly these two steps (with some slight rewording):
[I.] We call a training set separable if there exists a vector [that divides the data, like a 2D vector from the origin dividing two sets of 2D points]... The training set is almost surely separable [when there's twice as many dimensions as there are points, and almost surely inseparable otherwise]...
Again, dumb observation that's obvious in hindsight, which makes it all the more impressive that they found a use for it. This is how paradigm shifts happen! An alternate title for the paper could've been "A Vector Is All You Need (to understand grokking)". Ok but assuming I understood the setup right, here's the actual finding: [II.] [Given infinite training time,] the model will always overfit for separable training sets[, and] for inseparable training sets the model will always generalize perfectly. However, when the training set is on the verge of separability... dynamics may take arbitrarily long times to reach the generalizing solution [rather than overfitting].
**This is the underlying mechanism of grokking in this setting**.
Or, in other words from Appendix B: grokking occurs near critical points in which solutions exchange stability and dynamics are generically slow
Assuming I understood that all correctly, this finally brings me to my philosophical critique of "grokking", which ends up being a complement to this paper: grokking is just a modal transition in algorithmic structure, which is exactly why it's seemingly related to topics as diverse as physical phase changes and the sudden appearance of large language models. I don't blame the statisticians for not recognizing it, but IMO they're capturing something far more fundamental than a behavioral quirk in some mathematical tool.Non-human animals (and maybe some really smart plants) obviously are capable of "learning" in some human-like way, but it rarely surpasses the basics of Pavlovian conditioning: they delineate quantitative objects in their perceptive field (as do unconscious particles when they mechanically interact with each other), computationally attach qualitative symbols to them based on experience (as do plants), and then calculate relations/groups of that data based on some evolutionarily-tuned algorithms (again, a capability I believe to be unique to animals and weird plants). Humans, on the other hand, not only perform calculations about our immediate environment, but also freely engage in meta-calculations -- this is why our smartest primate relatives are still incapable of posing questions, yet humans pose them naturally from an extremely young age.
Details aside, my point is that different orders of cognition are different not just in some quantitative way, like an increase in linear efficiency, but rather in a qualitative--or, to use the hot lingo, emergent--way. In my non-credentialed opinion, this paper is a beautiful formalization of that phenomenon, even though it necessarily is stuck at the bottom of the stack so-to-speak, describing the switch in cognitive capacity from direct quantification to symbolic qualification.
It's very possible I'm clouded by the need to confirm my priors, but if not, I hope to see this paper see wide use among ML researchers as a clean, simplified exposition of what we're all really trying to do here on a fundamental level. A generalization of generalization, if you will!
Alon, Noam, and Yohai, if you're in here, congrats for devising such a dumb paper that is all the more useful & insightful because of it. I'd love to hear your hot takes on the connections between grokking, cognition, and physics too, if you have any that didn't make the cut!
I feel super confused about this paper.
Apparently their training goal is for the model to ignore all input values and output a constant. Sure.
But then they outline some kind of equation of when grokking will or won't happen, and... I don't get it?
For a goal that simple, won't any neural network with any amount of weight decay eventually converge to a stack of all-zeros matrices (plus a single bias)?
What is this paper even saying, on an empirical level?