In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"
So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?
https://arxiv.org/pdf/2503.02113
This paper shows that polynomials show most features of deep neural nets, including double descent and ability to memorize entire dataset.
It connects dots there - polynomials there are regularized to be as simple as possible and author argues that hundredths of billions of parameters in modern neural networks work as a regularizers too, they attenuate decisions that "too risky."
I really enjoyed that paper, a gem that puts light everywhere.
The degree-10000 polynomial is definitely overfitting - every data point has its own little spike. The truncated curves look kind of nice, but the data points aren't shown on those plots, and the curves aren't very close to them.
There are also some enormous numbers hidden away here too, with associated floating point precision problems; the articles show the coefficients are small, but that's because the polynomial basis functions themselves have gigantic numbers in them. The Bernstein basis for degree-100 involves 100 choose 50, which is already > 10^29. You have to be careful calculating these polynomials or bits of your calculation exceed 64-bit floating point range, e.g. factorials of 10000, 2^10000. See the formulas and table in this section: https://en.wikipedia.org/wiki/Legendre_polynomials#Rodrigues...
No. All the textbooks know that polynomials of high degree are numerically dangerous and you need to be careful when handling them.
The articles examples only work because the interval 0 to 1 (or -1 to 1) were chosen. For whatever reason the author does not point that out or even acknowledges the fact that had he chosen a larger interval the limitations of floating point arithmetic would have ruined the argument he was trying to make.
10^100 is a very large number and numerically difficult to treat. For whatever reason the author pretends this is not a valid reason to be cautious about high degree polynomials.
Well, one catch is that you're doing degree ten thousand. That alone already increases your computational costs and can run afoul of numerical error as your numbers start losing precision. You'll have to start worrying about overflow and underflow. Horner's method or another representation of your polynomial might start to become important. You'll also start to notice your CPU spiking more. In essence, this is kind of fitting more and more points until you get rid of every oscillation you don't want.
The other catch, which the author mentions, is that extrapolation is still essentially impossible with polynomials. This is easy to see. The highest-degree term will dominate all the other terms by an order of magnitude once you step out of the interval of interest. Every non-constant polynomial grows without bound.
Honestly, though, what the author is doing isn't much different than a Taylor approximation. If you're okay fitting any polynomial in a small neighbourhood of the data, then go whole hog and fit the best possible polynomial: a Taylor polynomial. You wouldn't usually need to go to that high degree to get what you want in a neighbourhood either.
The original paper about the bias variance tradeoff, that the double decent papers targeted, had some specific constraints.
1) data availability and computer limited training set sizes. 2) they could simulate infinite datasets.
While challenging for our minds, training set sizes today make it highly likely that the patterns in your test set are similar to concept classes in your training set.
This is very different than saying procedure or random generated test sets, both of which can lead to problems like over fitting with over parameterized networks.
When the chances are that similar patterns exist, the cost of some memorization goes down and is actually somewhat helpful for generalization.
There are obviously more factors at play here, but go look at the double decent papers and their citations to early 90's papers and you will see this.
The low sensitivity of transformers also dramatically helps, with UHAT without CoT only having the expressiveness of TC0, and with log space scratch space having PTIME expressability.
You can view this from autograd requiring a smooth manifold with the ability to approximate global gradient too if that works better for you.
But yes all intros have to simplify concepts, and there are open questions.
Numerical analysis is ALL about picking the right tool for the job. Unfortunately, there aren't super methods and you need to know how to classify problems.
> So... are _all_ introductions to machine learning just extremely wrong here?
It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2 (since excel doesn't do a holdout), so it's a very common mistake new students make.
It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.
It's not like they would sabotage competitors by training them wrong, as a joke.
The catch is that orthogonal polynomial bases (like Legendre) implicitly regularize by controlling the Riesz representer norm, effectively implementing a form of spectral filtering that penalizes high-frequency components.