> Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA- 8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight directions capture dominant variance despite vast differences in training data, objectives, and initialization.
Isn't it obvious?
This general idea shows up all over the place though. If you do 3D scans on thousands of mammal skulls, you'll find that a few PCs account for the vast majority of the variance. If you do frequency domain analysis of various physiological signals...same thing. Ditto for many, many other natural phenomena in the world. Interesting (maybe not surprising?) to see it in artificial phenomena as well
Not really. If the models are trained on different dataset - like one ViT trained on satellite images and another on medical X-rays - one would expect their parameters, which were randomly initialized to be completely different or even orthogonal.
Well intuitively it makes sense that within each independent model, a small number of weights / parameters are very dominant, but it’s still super interesting that these can be swapped between all the models without loss of performance.
It isn’t obvious that these parameters are universal across all models.