Grokking is fascinating! It seems tied to how neural networks hit critical points in generalization. Could this concept also enhance efficiency in models dealing with non-linearly separable data?
Could you expand about grokking [1]? I superficially understand what it means but it seems more important that the article conveys.
Particularly:
> Grokking can be understood as a phase transition during the training process. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research.
Could you expand about grokking [1]? I superficially understand what it means but it seems more important that the article conveys.
Particularly:
> Grokking can be understood as a phase transition during the training process. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research.
Does that paper add more insights?
[1] https://en.wikipedia.org/wiki/Grokking_(machine_learning)?wp...