logoalt Hacker News

stuxnet79last Tuesday at 8:13 PM3 repliesview on HN

I'm exactly in the category of AI majors who are not familiar with numerical methods. Can you broadly explain where the gap in AI pedagogy is and how students can fill it?

The series of articles posted here are interesting and I plan to review them in more detail. But I'm concerned about what the "unknown-unknowns" are.


Replies

programjameslast Tuesday at 9:32 PM

Sure. The original normalizing flows used a fixed number of layers. Someone at UToronto recognized that, as the number of layers gets very deep, this is essentially an ordinary differential equation (ODE). Why?

Suppose you have n residual layers that look like:

x_0 = input x_{i+1} = x_i + f(x_i) x_n = output

If you replace them with an infinite number of layers, and use "time" t instead of "layer" i, you get

x(t+dt) = x(t) + dt f(x(t)) <=> x'(t) = f(x, t)

so to find the output, you just need to solve an ODE. It gets better! The goal of normalizing flows is to "flow" your probability distribution from a normal distribution to some other (e.g. image) distribution. This is usually done by trying to maximize the probability the training images should show up, according to your model, i.e.

loss(model) = product model^{-1}(training image)

Notice how you need the model to be reversible, which is pretty annoying to implement in the finite-layer case, but with some pretty lenient assumptions is guaranteed to be true for an ODE. Also, when you're inverting the model, the probabilities will change according to the derivative; since you have more than one dimension, this means you need to calculate the determinant of the Jacobian for every layer, which is decently costly in the finite-layer case. There are some tricks that can bring this down to O(layer size^2) (Hutchinson++), but the ODE case is trivial to compute (just exp(trace)).

So, turning the model into an ODE makes it blazing fast, and since you can use any ODE solver, you can train at different levels of precision based on the learning rate (i.e. the real log canonical threshold from singular learning theory). I haven't seen any papers that do this exactly, but it's common to use rougher approximations at the beginning of training. Probably the best example of this is the company Liquid AI.

Finally, this all turns out to be very similar to diffusion models. Someone realized this, and combined the two ideas into flow-matching.

-----

This is one place it's super useful to know numerical methods, but here are a couple others:

1. Weight initialization --> need to know stability analysis

2. Convolutions --> the Winograd algorithm, which is similar to ideas in the FFT and quadrature

grandempireyesterday at 2:11 AM

> Can you broadly explain where the gap in AI pedagogy is and how students can fill it?

Machine learning can be made much more effective and efficient, by first modeling the problem, and then optimizing that tailored representation. This is an alternative to throwing a bunch of layers of neurons, or copy pasting an architecture, and hoping something works out.

One of the most successful applications of ML is convolutional neural networks and is based on this principle. Image processing algorithms come from an optical theory which can be modeled with convolution - what if we used optimization to find those convolution kernels.

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

Also you need to know when a problem is NOT optimization - for example solving equations via the bisection method.

constantcryinglast Tuesday at 8:38 PM

>But I'm concerned about what the "unknown-unknowns" are.

Try the examples in the article with the interval 0 to 10.

show 1 reply