For those trying to understand the most important parts of the paper, here's what I think is th...

altairprime • yesterday at 1:57 AM • 5 replies • view on HN

For those trying to understand the most important parts of the paper, here's what I think is the most significant two statements, subquoted out of two (consecutive) paragraphs midway through the paper:

> we selected five additional, previously unseen pretrained ViT models for which we had access to evaluation data. These models, considered out-of-domain relative to the initial set, had all their weights reconstructed by projecting onto the identified 16-dimensional universal subspace. We then assessed their classification accuracy and found no significant drop in performance

> we can replace these 500 ViT models with a single Universal Subspace model. Ignoring the task-variable first and last layer [...] we observe a requirement of 100 × less memory, and these savings are prone to increase as the number of trained models increases. We note that we are, to the best of our knowledge, the first work, to be able to merge 500 (and theoretically more) Vision Transformer into a single universal subspace model. This result implies that hundreds of ViTs can be represented using a single subspace model

So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.

For a tech analogy, imagine if you found a bzip2 dictionary that reduced the size of every file compressed by 99%, because that dictionary turns out to be uniformly helpful for all files. You would immediately open a pull request to bzip2 to have the dictionary built-in, because it would save everyone billions of CPU hours. [*]

[*] Except instead of 'bzip2 dictionary' (strings of bytes), they use the term 'weight subspace' (analogy not included here[**]) — and, 'file compression' hours becomes 'model training' hours. It's just an analogy.

[**] 'Hilbert subspaces' is just incorrect enough to be worth appending as a footnote[***].

[***] As a second footnote.

Replies

tsurba • yesterday at 5:25 AM

Edit: actually this paper is the canonical reference (?): https://arxiv.org/abs/2007.00810 models converge to same space up to a linear transformation. Makes sense that a linear transformation (like PCA) would be able to undo that transformation.

You can show for example that siamese encoders for time-series, with MSE loss on similarity, without a decoder, will converge to the the same latent space up to orthogonal transformations (as MSE is kinda like gaussian prior which doesn’t distinguish between different rotations).

Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space, up to some, likely linear, transformations. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.

All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI

They also have a previous paper (”CEBRA”) published in Nature with similar results.

westoncb • yesterday at 2:55 AM

> So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.

Could someone clarify what this means in practice? If there is a 'commonality' why would substituting it do anything? Like if there's some subset of weights X found in all these models, how would substituting X with X be useful?

I see how this could be useful in principle (and obviously it's very interesting), but not clear on how it works in practice. Could you e.g. train new models with that weight subset initialized to this universal set? And how 'universal' is it? Just for like like models of certain sizes and architectures, or in some way more durable than that?

➕ show 2 replies

N_Lens • yesterday at 4:20 AM

If models naturally occupy shared spectral subspaces, this could dramatically reduce

- Training costs: We might discover these universal subspaces without training thousands of models

- Storage requirements: Models could share common subspace representations

scotty79 • yesterday at 4:10 AM

"16 dimensions is all you need" ... to do human achievable stuff at least

scotty79 • yesterday at 4:11 AM

16 seems like a suspiciously round number ... why not 17 or 13? ... is this just result of some bug in the code they used to do their science?

or is it just that 16 was arbitrarily chosen by them as close enough to the actual minimal number of dimensions necessary?

➕ show 2 replies

alt Hacker News

Replies