Had a QuantSci Prof who was fond of asking "Who can name a data collection scenario where the x data has no error?" and then taught Deming regression as a generally preferred analysis [1]
You can think of it as: linear regression models only noise in y and not x, whereas ellipse/eigenvector of the PCA models noise in both x and y.
I haven't dealt with statistics for a while, but what I don't get is why squares specifically? Why not power of 1, or 3, or 4, or anything else? I've seen squares come up a lot in statistics. One explanation that I didn't really like is that it's easier to work with because you don't have to use abs() since everything is positive. OK, but why not another even power like 4? Different powers should give you different results. Which seems like a big deal because statistics is used to explain important things and to guide our life wrt those important things. What makes squares the best? I can't recall other times I've seen squares used, as my memories of my statistics training is quite blurry now, but they seem to pop up here and there in statistics relatively often, it seems.
Sorry for my negativity / meta comment on this thread. From what I can tell the stackexchange discussion in the submission already to provides all the relevant points to be discussed about this.
While the asymmetry of least squares will probably be a bit of a novelty/surprise to some, pretty much anything posted here is more or less a copy of one of the comments on stackexchange.
[Challenge: provide a genuinely novel on-topic take on the subject.]
The least squares and pca minimize different loss functions. One is sum of squares of vertical(y) distances, another is is sum of closest distances to the line. That introduces the differences.
If you plot the regression line of y against x, and also x against y, you would get two different lines.
I found it in the middle of teaching a stats class, and feel embarrassed.
I guess normalising is one way to remove the bias.
A note mostly about terminology:
The least squares model will produce unbiassed predictions of y given x, i.e. predictions for which the average error is zero. This is the usual technical definition of unbiassed in statistics, but may not correspond to common usage.
Whether x is a noisy measurement or not is sort of irrelevant to this -- you make the prediction with the information you have.
Many times I've looked at the output of a regression model, seen this effect, and then thought my model must be very bad. But then remember the points made elsewhere in thread.
One way to visually check that the fit line has the right slope is to (1) pick some x value, and then (2) ensure that the noise on top of the fit is roughly balanced on either side. I.e., that the result does look like y = prediction(x) + epsilon, with epsilon some symmetric noise.
One other point is that if you try to simulate some data as, say
y = 1.5 * x + random noise
then do a least squares fit, you will recover the 1.5 slope, and still it may look visually off to you.
This problem is usually known as regression dilution, discussed here: https://en.wikipedia.org/wiki/Regression_dilution
Yes people want to mentally rotate, but that's not correct. This is not a "geometric" coordinate system independent operation.
IMO this is a basic risk to graphs. It is great to use imagery to engage the spatial reasoning parts of our brain. But sometimes, it is deceiving — like this case —because we impute geometric structure which isn't true about the mathematical construct being visualized.
You would probably get what you want with a Deming regression.
I think the linear least squares is like a shear, whereas the eigenvector is a rotation.
My head canon:
If the true value is medium high, any random measurements that lie even further above are easily explained, as that is a low ratio of divergence. If the true value is medium high, any random measurements that lie below by a lot are harder to explain, since their (relative, i.e.) ratio of divergence is high.
Therefore, the further you go right in the graph, the more a slightly lower guess is a good fit, even if many values then lie above it.
> So, instead, I then diagonalized the covariance matrix to obtain the eigenvector that gives the direction of maximum variance.
...as one does...
This is probably obvious, but there is another form of regression that uses Mean Absolute Error rather than Squared Error as this approach is less prone to outliers. The Math isn’t as elegant, tho.
This is why my favorite best fit algorithm is RANSAC.
[dead]
[dead]
Linear Regression a.k.a. Ordinary Least Squares assumes only Y has noise, and X is correct.
Your "visual inspection" assumes both X and Y have noise. That's called Total Least Squares.