A well-known related paper that I didn't see mentioned in the article (although Trefethen was mentioned) is "Six Myths of Polynomial Interpolation and Quadrature".
This article is much better, more informative, and factual than I'd have expected from the title. Note that it's part of an 8-article series.
Worth a read if you're ever fitting functions.
There's something I'm fundamentally missing here--if the standard basis and the Berenstain basis describe exactly the same set of polynomials of degree n, then surely the polynomial of degree n that minimizes the mean square error is singular (and independent of the basis--the error is the samples vs the approximation, the coefficients/basis are not involved) so both the standard basis and Berenstain basis solution are the same (pathological, overfitted, oscillating) curve?
Like I understand how the standard basis is pathological because the higher degree powers diverge like mad so given "reasonable" components the Berenstain basis is more likely to give "reasonable" curves but if you're already maximizing I don't understand how you arrive at a different curve.
What am I missing?
This is why I believe a numerical methods course should be a requirement for any AI majors.
This part about double decent is really good: https://alexshtf.github.io/2025/03/27/Free-Poly.html#fnref:2
See also this stackexchange answer, which makes basically the same point:
https://stats.stackexchange.com/questions/560383/how-can-we-...
I think the posted article has some important confusions. First is the distinction between a basis and an objective. If I fit a polynomial to go through every point in my dataset, I will end up with exactly the same resulting solution, it doesn't matter what basis I use to represent it (although the basis-specific coefficients will of course differ). This is because all polynomial bases (by definition) represent the same hypothesis class, and the recipe "fit a polynomial that passes through every point" can be expressed as an optimization problem over that hypothesis class. More generally, any two bases will give exactly the same solution when optimizing for the same objective (ignoring things like floating point errors).
So why do the results with Bernstein basis look better than for the standard basis? It's because they are actually optimizing a different objective function (which is explicitly written out in the stackexchange post). So in that sense the Bernstein-vs.-standard basis is really a comparison between different objective functions rather than between different bases. It so happens that the solution to the new objective function has a simple expression in terms of the Bernstein basis; but in principle I could set up and solve the objective in terms of the standard basis and obtain exactly the same result. From this perspective, the Bernstein polynomials are nothing more than a convenient computational device for expressing the solution to the objective. The OP kind of gets at this with the distinction between Fitting and Interpolation, but seems to conflate different fitting procedures with different bases.
Secondly, there is actually no need for explicit regularization when fitting Bernstein polynomials (cf. the stackexchange post). The way the OP fits Bernstein polynomials is non-standard. Although one can say that the specific form of the objective function provides an important source of implicit regularization.
So in conclusion I think the OP has importantly mis-attributed the source of success of the Bernstein method. It does not have to do with explicit regularization or choice of basis, and has everything to do with re-defining the objective function.
In this case wouldn't a Fourier-type approach work better? At least there's no risk the function blows up and it possibly needs fewer parameters?
It should also be noted that this kind of fitting is extremely closely related to integration or in other words calculating the mean.
By using the "right" kind of polynomial basis you can get a polynomial approximation which also tells you the mean and variance of the function under a random variable.
For large polynomial degree, this seems to me to be very similar to the P-spline method with a very large number of parameters.
Wanted to add my voice to the chorus of appreciation for this article (actually a series of 8). Very informative and engaging.
Great article! Very curious how this orthogonalization + regularization idea could be extended to other kinds of series, such as Fourier series.
See also https://en.wikipedia.org/wiki/Chebyshev_polynomials
They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.
good article, but I am very bother by the "standard basis"... it's called canonical in math. I don't think standard is the right name in any context.
I completely disagree with the conclusion of the article. The reason the examples worked so well is because of an arbitrary choice, which went completely uncommented.
The interval was chosen as 0 to 1. This single fact was what made this feasible. Had the interval been chosen as 0 to 10. A degree 100 polynomial would have to calculate 10^100, this would have lead to drastic numerical errors.
The article totally fails to give any of the totally legitimate and very important reason why high degree polynomials are dangerous. It is absurd to say that well known numerical problems do not exist because you just found one example where they did not occur.
One thought I had while looking into regression recently is to consider the model created with a given regularization coefficient not as a line on a 2 dimensional graph but as a slice of a surface on a 3 dimensions graph where the third dimension is the regularization coefficient.
In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.
More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.
In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"
So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?