logoalt Hacker News

K0balt11/08/20241 replyview on HN

Yeah,it reflects the “feel” I get from lLoRa as well, especially if I overdo it. The new data becomes the preferred output even for unrelated inputs. I always felt like it was bludgeoning the model to some extent vs finetuning.

Also, LoRa tuning an extensively tuned model occasionally provokes full on delusional “insanity” or gibberish seizures.

I have had really good luck though using a highly tuned model as the training basis for a LoRa and then applying that LoRa mask to the base version of that model. I’m not sure why that seems to work better than the same LoRa training directly on the base model.


Replies

cheald11/08/2024

I've done a lot of tinkering with the internals of LoRA training, specifically investigating why fine-tune and LoRA training result in such different results, and I'm no academic, but I have found that there are definitely some issues with the SOTA at least WRT Stable Diffusion.

I've had significant success with alternate init mechanisms (the standard technique of init'ing B to zeros really does hurt gradient flow), training alpha as a separate parameter (and especially if you bootstrap the process with alphas learned from a previous run), and altering the per-layer learning rates (because (lr * B) @ (lr @ A) produces an update of a fundamentally different magnitude than the fine-tune update of W * lr = lr * B @ A).

In the context of Stable Diffusion specifically, as well, there's some really pathological stuff that happens when training text encoders alongside the unet; for SD-1.5, the norm of "good" embeddings settles right around 28.0, but the model learns that it can reduce loss by pushing the embeddings away from that value. However, this comes at the cost of de-generalizing your outputs! Adding a second loss term which penalizes the network for drifting away from the L1 norm of the untrained embeddings for a given text substantially reduces the "insanity" tendencies. There's a more complete writeup at https://github.com/kohya-ss/sd-scripts/discussions/294#discu...

You also have the fact that the current SOTA training tools just straight up don't train some layers that fine-tunes do.

I do think there's a huge amount of ground to be gained in diffusion LoRA training, but most of the existing techniques work well enough that people settle for "good enough".

show 1 reply