The "neural network" they are using is linear: matrix * data + bias. It's expressing a decision plane. There are two senses in which it can learn the constant classification: by pushing the bias very far away and by contorting the matrix around to rotate all the training data to the same side of the decision plane. Pushing the bias outwards generalizes well to data outside the training set, but contorting the matrix (rotating the decision plane) doesn't.
They discover that the training process tends to "overfit" using the matrix when the data is too sparse to cover the origin in its convex hull, but tends to push the bias outwards when the training data surrounds the origin. It turns out that the probability of the convex hull problem occurring goes from 0 to 1 in a brief transition when the ratio of the number of data points to the number of dimensions crosses 1/2.
They then attempt to draw an analogy between that, and the tendency of sparsely trained NNs to overfit until they have a magic amount of data, at which point they spontaneously seem to "get" whatever it is they're being trained on, gaining the ability to generalize.
Their examples are likely the simplest models to exhibit a transition from overfitting to generalization when the amount of training data crosses a threshold, but it remains to be seen if they exhibit it for similar reasons to the big networks, and if so what the general theory would be. The paper is remarkable for using analytic tools to predict the result of training, normally only obtained through numerical experiments.
> It's expressing a decision plane. There are two senses in which it can learn the constant classification: by pushing the bias very far away and by contorting the matrix around to rotate all the training data to the same side of the decision plane.
You're saying that like you expect it to be intuitively understandable to the average HN audience. It really isn't.
What do you mean by "rotate all the training data to the same side of the decision plane"? As in, the S matrix rotates all inputs vector to output vectors that are "on the right side of the plane"? That... doesn't make sense to me; as you point out, the network is linear, there's no ReLU, so the network isn't trying to get data on "the same side of the plane", it's trying to get data "on the plane". (And it's not rotating anything, it's a scalar product, not a matmul. S is one-dimensional.)
(Also I think their target label is zero anyway, give how they're presenting their loss function?)
But in any case, linear transformation or not, I'd still expect the weight and bias matrices to converge to zero given any amount of weight decay whatsoever. That's the core insight of the original grokking papers: even once you've overfit, you can still generalize if you do weight decay.
It's weird that the article doesn't mention decay at all.
As the origin is special, instead of training in a linear space, what would training in an affine space do?