Could you expand about grokking [1]? I superficially understand what it means but it seems more important that the article conveys.
Particularly:
> Grokking can be understood as a phase transition during the training process. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research.
Does that paper add more insights?
[1] https://en.wikipedia.org/wiki/Grokking_(machine_learning)?wp...
This is probably the most interesting (and insightful) paper on grokking I’ve read recently: https://arxiv.org/abs/2402.15555