> It's expressing a decision plane. There are two senses in which it can learn the constant classification: by pushing the bias very far away and by contorting the matrix around to rotate all the training data to the same side of the decision plane.
You're saying that like you expect it to be intuitively understandable to the average HN audience. It really isn't.
What do you mean by "rotate all the training data to the same side of the decision plane"? As in, the S matrix rotates all inputs vector to output vectors that are "on the right side of the plane"? That... doesn't make sense to me; as you point out, the network is linear, there's no ReLU, so the network isn't trying to get data on "the same side of the plane", it's trying to get data "on the plane". (And it's not rotating anything, it's a scalar product, not a matmul. S is one-dimensional.)
(Also I think their target label is zero anyway, give how they're presenting their loss function?)
But in any case, linear transformation or not, I'd still expect the weight and bias matrices to converge to zero given any amount of weight decay whatsoever. That's the core insight of the original grokking papers: even once you've overfit, you can still generalize if you do weight decay.
It's weird that the article doesn't mention decay at all.
The network is linear, but the loss is ln(1+exp(x)), a soft ReLU.