The universal weight subspace hypothesis

128 points • by lukeplato • today at 12:16 AM • 40 comments • view on HN

Comments

For those trying to understand the most important parts of the paper, here's what I think is the most significant two statements, subquoted out of two (consecutive) paragraphs midway through the paper:

> we selected five additional, previously unseen pretrained ViT models for which we had access to evaluation data. These models, considered out-of-domain relative to the initial set, had all their weights reconstructed by projecting onto the identified 16-dimensional universal subspace. We then assessed their classification accuracy and found no significant drop in performance

> we can replace these 500 ViT models with a single Universal Subspace model. Ignoring the task-variable first and last layer [...] we observe a requirement of 100 × less memory, and these savings are prone to increase as the number of trained models increases. We note that we are, to the best of our knowledge, the first work, to be able to merge 500 (and theoretically more) Vision Transformer into a single universal subspace model. This result implies that hundreds of ViTs can be represented using a single subspace model

So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.

For a tech analogy, imagine if you found a bzip2 dictionary that reduced the size of every file compressed by 99%, because that dictionary turns out to be uniformly helpful for all files. You would immediately open a pull request to bzip2 to have the dictionary built-in, because it would save everyone billions of CPU hours. [*]

[*] Except instead of 'bzip2 dictionary' (strings of bytes), they use the term 'weight subspace' (analogy not included here[**]) — and, 'file compression' hours becomes 'model training' hours. It's just an analogy.

[**] 'Hilbert subspaces' is just incorrect enough to be worth appending as a footnote[***].

[***] As a second footnote.

➕ show 4 replies

kacesensitive • today at 1:10 AM

interesting.. this could make training much faster if there’s a universal low dimensional space that models naturally converge into, since you could initialize or constrain training inside that space instead of spending massive compute rediscovering it from scratch every time

➕ show 2 replies

VikingCoder • today at 1:12 AM

I find myself wanting genetic algorithms to be applied to try to develop and improve these structures...

But I always want Genetic Algorithms to show up in any discussion about neural networks...

➕ show 3 replies

canjobear • today at 12:58 AM

What's the relationship with the Platonic Representation Hypothesis?

➕ show 3 replies

nothrowaways • today at 2:44 AM

What if all models are secretly just fine tunes of llama?

masteranza • today at 2:22 AM

It's basically way better than LoRA under all respects and could even be used to speed up inference. I wonder whether the big models are not using it already... If not we'll see a blow up in capabilities very, very soon. What they've shown is that you can find the subset of parameters responsible for transfer of capability to new tasks. Does it apply to completely novel tasks? No, that would be magic. Tasks that need new features or representations break the method, but if it fits in the same domain then the answer is "YES".

Here's a very cool analogy from GPT 5.1 which hits the nail in the head in explaining the role of subspace in learning new tasks by analogy with 3d graphics.

  Think of 3D character animation rigs:
  
   • The mesh has millions of vertices (11M weights).
  
   • Expressions are controlled via:
  
   • “smile”
  
   • “frown”
  
   • “blink”
  
  Each expression is just:
  
  mesh += α_i \* basis_expression_i
  
  Hundreds of coefficients modify millions of coordinates.

➕ show 1 reply

nothrowaways • today at 2:55 AM

> Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA- 8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight directions capture dominant variance despite vast differences in training data, objectives, and initialization.

Isn't it obvious?

➕ show 2 replies

AIorNot • today at 1:51 AM

Interesting - I wonder if this ties into the Platonic Space Hypothesis recently being championed by computational biologist Mike Levin

E.g

https://youtu.be/Qp0rCU49lMs?si=UXbSBD3Xxpy9e3uY

https://thoughtforms.life/symposium-on-the-platonic-space/

e.g see this paper on Universal Embeddings https://arxiv.org/html/2505.12540v2

"The Platonic Representation Hypothesis [17] conjectures that all image models of sufficient size have the same latent representation. We propose a stronger, constructive version of this hypothesis for text models: the universal latent structure of text representations can be learned and, furthermore, harnessed to translate representations from one space to another without any paired data or encoders.

In this work, we show that the Strong Platonic Representation Hypothesis holds in practice. Given unpaired examples of embeddings from two models with different architectures and training data, our method learns a latent representation in which the embeddings are almost identical"

Also from the OP's Paper we see this on statement:

"Why do these universal subspaces emerge? While the precise mechanisms driving this phenomenon remain an open area of investigation, several theoretical factors likely contribute to the emergence of these shared structures.

First, neural networks are known to exhibit a spectral bias toward low frequency functions, creating a polynomial decay in eigenvalues that concentrates learning dynamics into a small number of dominant directions (Belfer et al., 2024; Bietti et al., 2019).

Second, modern architectures impose strong inductive biases that constrain the solution space: convolutional structures inherently favor local, Gabor-like patterns (Krizhevsky et al., 2012; Guth et al., 2024), while attention mechanisms prioritize recurring relational circuits (Olah et al., 2020; Chughtai et al., 2023).

Third, the ubiquity of gradient-based optimization – governed by kernels that are largely invariant to task specifics in the infinite-width limit (Jacot et al., 2018) – inherently prefers smooth solutions, channeling diverse learning trajectories toward shared geometric manifolds (Garipov et al., 2018).

If these hypotheses hold, the universal subspace likely captures fundamental computational patterns that transcend specific tasks, potentially explaining the efficacy of transfer learning and why diverse problems often benefit from similar architectural modifications."

➕ show 1 reply

mwkaufma • today at 1:30 AM

(Finds a compression artifact) "Is this the meaning of consciousness???"

CGMthrowaway • today at 12:50 AM

They compressed the compression? Or identified an embedding that can "bootstrap" training with a headstart ?

Not a technical person just trying to put it in other words.

➕ show 1 reply

ibgeek • today at 1:42 AM

They are analyzing models trained on classification tasks. At the end of the day, classification is about (a) engineering features that separate the classes and (b) finding a way to represent the boundary. It's not surprising to me that they would find these models can be described using a small number of dimensions and that they would observe similar structure across classification problems. The number of dimensions needed is basically a function of the number of classes. Embeddings in 1 dimension can linearly separate 2 classes, 2 dimensions can linearly separate 4 classes, 3 dimensions can linearly separate 8 classes, etc.

➕ show 1 reply

farhanhubble • today at 1:46 AM

Would you see a lower rank subspace if the learned weights were just random vectors?

api • today at 1:22 AM

I immediately started thinking that if there are such patterns maybe they capture something about the deeper structure of the universe.

➕ show 1 reply

nextworddev • today at 2:46 AM

The central claim, or "Universal Weight Subspace Hypothesis," is that deep neural networks, even when trained on completely different tasks (like image recognition vs. text generation) and starting from different random conditions, tend to converge to a remarkably similar, low-dimensional "subspace" in their massive set of weights.

odyssey7 • today at 2:29 AM

Now that we know about this, that the calculations in the trained models follow some particular forms, is there an approximation algorithm to run the models without GPUs?

pagekicker • today at 2:07 AM

I asked Grok to visualize this:

https://grok.com/share/bGVnYWN5_463d51c8-d473-47d6-bb1f-6666...

*Caption for the two images:*

Artistic visualization of the universal low-parameter subspaces discovered in large neural networks (as described in “The Unreasonable Effectiveness of Low-Rank Subspaces,” arXiv:2512.05117).

The bright, sparse linear scaffold in the foreground represents the tiny handful of dominant principal directions (often ≤16 per layer) that capture almost all of the signal variance across hundreds of independently trained models. These directions form a flat, low-rank “skeleton” that is remarkably consistent across architectures, tasks, and random initializations.

The faint, diffuse cloud of connections fading into the dark background symbolizes the astronomically high-dimensional ambient parameter space (billions to trillions of dimensions), almost all of whose directions carry near-zero variance and can be discarded with negligible loss in performance. The sharp spectral decay creates a dramatic “elbow,” leaving trained networks effectively confined to this thin, shared, low-dimensional linear spine floating in an otherwise vast and mostly empty void.

➕ show 1 reply

alt Hacker News

The universal weight subspace hypothesis

Comments