LoRA vs. Full Fine-Tuning: An Illusion of Equivalence

236 points • by timbilt • 11/08/2024 • 53 comments • view on HN

Comments

K0balt • 11/08/2024

So, in layman’s terms, LoRa appears to “traumatize “ the model to some degree, connecting the vector space with strong “jumpers” (intruder dimensions) to change it’s behavior, instead of subtly conforming the entire model into a shape that accommodates the new data.

These jumpers or shortcuts do create connections between the relevant new concepts in the model, but by directly connecting them instead of associating them through the existing network of concepts, nuance is lost and the bypassed areas become deemphasized, leading to forgetting of previously held associations.

Because of this, In general, fine tuning produces better results than LoRa in most cases, especially when forgetting of existing training is detrimental.

Or, to further oversimplify the issue in SE terms, LoRa == monkeypatching. (Is this a kind of intruder dimension?)

➕ show 3 replies

pwillia7 • 11/08/2024

This tracks with my feelings making and using Stable Diffusion Loras and fine tunes. Still, with the speed to train and use, Loras have worked for me in most use cases and it hasn't been worth fine tuning the entire model.

➕ show 1 reply

Der_Einzige • 11/08/2024

This paper seems dubious, because it flies in the face of what the reft/pyreft paper is showing (you can use 0.0001% of the parameters trained for 100 epochs to personalize on a small dataset):

https://github.com/stanfordnlp/pyreft

https://arxiv.org/abs/2404.03592

Note that the OP paper is not peer reviewed yet, and while the one I linked isn't either, it has Christopher Manning (yes, the one you know from youtube), the head of AI at Stanford, as a co-author.

In general, I think that Lora and especially reft should be more resistant to catastrophic forgetting due to them literally not impacting most of the model.

The Stable Diffusion community has literally tens of thousands of lora's that don't cripple a model at small rank.

➕ show 1 reply

Eisenstein • 11/08/2024

Is this just specifying what has been known, that LoRAs skew towards the new training heavily and are not 'more intelligent' just 'more targeted' and become less intelligent the more they are targeted? Or is this proposing something else? I am having a difficult time understanding exactly what 'intruder dimensions' are.

sorenjan • 11/08/2024

> We randomly initialize A such that it has singular values of 1, freeze it, and only train B. When we do this, we see a sharp reduction in high ranking intruder dimensions in comparison to those in normal LoRA

This sounds interesting, but I can't see that they do much with this result. Are they saving it for a follow up paper? I would think that if their whole paper is about a big problem with LoRAs and they then find what looks like an easy solution for that problem that would warrant more than a paragraph just before the conclusion.

It would also have been interesting if they included the DoRA method, they reference it briefly and that paper claims to resemble fine tuning learning behavior.

But perhaps this paper is focused on LoRA behavior, and a separate paper comparing various improvements is better.

➕ show 1 reply

deskr • 11/08/2024

What an unfortunate choice of name. LoRa is already a big project.

➕ show 3 replies

viktour19 • 11/08/2024

> LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution.

The ability for nnets to generalize is inherently tied to their trainable parameter count via mechanisms we don't understand but we know parameter count is the key. When you finetune with lora, you're updating maybe 5% of the parameters, I really don't think there is an illusion of equivalence in the field.

➕ show 3 replies

blacklion • 11/08/2024

Each time I see "LoRA" in title I want to click it. Until I understand, that it is about DNN and not LoRa long distance radio modulation.

danielhanchen • 11/08/2024

TLDR: 1. Use alpha = 2*rank

2. Don't use too small ranks (rank=1 to 8)

3. Sensational title. Better title "LoRA works if done right"

4. Didn't test SVD init

➕ show 1 reply

bArray • 11/08/2024

[flagged]

➕ show 2 replies

AstroJetson • 11/08/2024

[flagged]

➕ show 4 replies

idorosen • 11/09/2024

Jacob Andreas is one of the smartest people I’ve ever met.

alt Hacker News

LoRA vs. Full Fine-Tuning: An Illusion of Equivalence

Comments