Sure. The original normalizing flows used a fixed number of layers. Someone at UToronto recognized t...

programjames • 04/22/2025 • 0 replies • view on HN

Sure. The original normalizing flows used a fixed number of layers. Someone at UToronto recognized that, as the number of layers gets very deep, this is essentially an ordinary differential equation (ODE). Why?

Suppose you have n residual layers that look like:

x_0 = input x_{i+1} = x_i + f(x_i) x_n = output

If you replace them with an infinite number of layers, and use "time" t instead of "layer" i, you get

x(t+dt) = x(t) + dt f(x(t)) <=> x'(t) = f(x, t)

so to find the output, you just need to solve an ODE. It gets better! The goal of normalizing flows is to "flow" your probability distribution from a normal distribution to some other (e.g. image) distribution. This is usually done by trying to maximize the probability the training images should show up, according to your model, i.e.

loss(model) = product model^{-1}(training image)

Notice how you need the model to be reversible, which is pretty annoying to implement in the finite-layer case, but with some pretty lenient assumptions is guaranteed to be true for an ODE. Also, when you're inverting the model, the probabilities will change according to the derivative; since you have more than one dimension, this means you need to calculate the determinant of the Jacobian for every layer, which is decently costly in the finite-layer case. There are some tricks that can bring this down to O(layer size^2) (Hutchinson++), but the ODE case is trivial to compute (just exp(trace)).

So, turning the model into an ODE makes it blazing fast, and since you can use any ODE solver, you can train at different levels of precision based on the learning rate (i.e. the real log canonical threshold from singular learning theory). I haven't seen any papers that do this exactly, but it's common to use rougher approximations at the beginning of training. Probably the best example of this is the company Liquid AI.

Finally, this all turns out to be very similar to diffusion models. Someone realized this, and combined the two ideas into flow-matching.

-----

This is one place it's super useful to know numerical methods, but here are a couple others:

1. Weight initialization --> need to know stability analysis

2. Convolutions --> the Winograd algorithm, which is similar to ideas in the FFT and quadrature

alt Hacker News