I feel super confused about this paper. Apparently their training goal is for the model to ignore ...

PoignardAzur • 10/11/2024 • 1 reply • view on HN

I feel super confused about this paper.

Apparently their training goal is for the model to ignore all input values and output a constant. Sure.

But then they outline some kind of equation of when grokking will or won't happen, and... I don't get it?

For a goal that simple, won't any neural network with any amount of weight decay eventually converge to a stack of all-zeros matrices (plus a single bias)?

What is this paper even saying, on an empirical level?

Replies

whatshisface • 10/12/2024

The "neural network" they are using is linear: matrix * data + bias. It's expressing a decision plane. There are two senses in which it can learn the constant classification: by pushing the bias very far away and by contorting the matrix around to rotate all the training data to the same side of the decision plane. Pushing the bias outwards generalizes well to data outside the training set, but contorting the matrix (rotating the decision plane) doesn't.

They discover that the training process tends to "overfit" using the matrix when the data is too sparse to cover the origin in its convex hull, but tends to push the bias outwards when the training data surrounds the origin. It turns out that the probability of the convex hull problem occurring goes from 0 to 1 in a brief transition when the ratio of the number of data points to the number of dimensions crosses 1/2.

They then attempt to draw an analogy between that, and the tendency of sparsely trained NNs to overfit until they have a magic amount of data, at which point they spontaneously seem to "get" whatever it is they're being trained on, gaining the ability to generalize.

Their examples are likely the simplest models to exhibit a transition from overfitting to generalization when the amount of training data crosses a threshold, but it remains to be seen if they exhibit it for similar reasons to the big networks, and if so what the general theory would be. The paper is remarkable for using analytic tools to predict the result of training, normally only obtained through numerical experiments.

➕ show 2 replies

alt Hacker News

Replies