> LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution.
The ability for nnets to generalize is inherently tied to their trainable parameter count via mechanisms we don't understand but we know parameter count is the key. When you finetune with lora, you're updating maybe 5% of the parameters, I really don't think there is an illusion of equivalence in the field.
More magnitude than count [1] I think, but I haven't kept up in a while.
[1] https://proceedings.neurips.cc/paper_files/paper/1996/file/f...
Well, I think it depends who you talk to. I suspect quite a few practitioners (as opposed to researchers) regard LoRA as a valid shortcut without full consideration of the difference.
> When you finetune with lora, you're updating maybe 5% of the parameters
I'm not sure I understand this comment. The LoRA paper[1] specifically says that all of the pretrained weights remain frozen.
> keeping the pre-trained weights frozen
Specifically, the LoRA paper differentiates itself from updating some parameters by stating
> Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks.
1. https://arxiv.org/pdf/2106.09685