logoalt Hacker News

cheald11/08/20241 replyview on HN

I've done a lot of tinkering with the internals of LoRA training, specifically investigating why fine-tune and LoRA training result in such different results, and I'm no academic, but I have found that there are definitely some issues with the SOTA at least WRT Stable Diffusion.

I've had significant success with alternate init mechanisms (the standard technique of init'ing B to zeros really does hurt gradient flow), training alpha as a separate parameter (and especially if you bootstrap the process with alphas learned from a previous run), and altering the per-layer learning rates (because (lr * B) @ (lr @ A) produces an update of a fundamentally different magnitude than the fine-tune update of W * lr = lr * B @ A).

In the context of Stable Diffusion specifically, as well, there's some really pathological stuff that happens when training text encoders alongside the unet; for SD-1.5, the norm of "good" embeddings settles right around 28.0, but the model learns that it can reduce loss by pushing the embeddings away from that value. However, this comes at the cost of de-generalizing your outputs! Adding a second loss term which penalizes the network for drifting away from the L1 norm of the untrained embeddings for a given text substantially reduces the "insanity" tendencies. There's a more complete writeup at https://github.com/kohya-ss/sd-scripts/discussions/294#discu...

You also have the fact that the current SOTA training tools just straight up don't train some layers that fine-tunes do.

I do think there's a huge amount of ground to be gained in diffusion LoRA training, but most of the existing techniques work well enough that people settle for "good enough".


Replies

doctorpangloss11/08/2024

Most people are using LoRAs as a solution for IP transfer.

Thing is Ideogram v2 has already achieved IP transfer without fine tuning or adapters. So we know those aren't needed.

Is Ideogram v2 an exotic architecture? No, I don't think so.

Are there exotic architectures that will solve IP transfer and other tasks? The Chameleon and OmniGen architectures. Lots of expertise went into SD3 and Flux dataset prep, but: the multimodal architectures are so much more flexible and expressive.

Flow matching models are maybe the last we will see before multi-modal goes big.

What to make of things in the community? How is it possible that random hyperparameters and 30 minute long fine tunings produce good results?

(1) Dreambooth effect: if it's like, a dog, you won't notice the flaws.

(2) Filing drawer problem. Nobody publishes the 99 things that didn't work.

(3) SD <3 struggled with IP transfer on image content that could not have possibly been in its datasets. But laypeople are not doing that. They don't have access to art content that Stability and BFL also don't have access to.

(4) Faces: of course SD family saw celebrity images. Faces are over-represented in its datasets. So yeah, it's going to be good at IP transfer of photographic faces. Most are in-sample.

show 1 reply