I don't understand the analogy. If I'm using an MCMC algorithm to sample a probability d...

dhampi • today at 4:32 AM • 1 reply • view on HN

I don't understand the analogy.

If I'm using an MCMC algorithm to sample a probability distribution, I need to wait for my Markov chain to converge to a stationary distribution before sampling, sure.

But in no way is 'a good answer' a stationary state in the LLM Markov chain. If I continue running next-token prediction, I'm not going to start looping.

Replies

crystal_revenge • today at 4:48 AM

I think you're confusing the sampling process and the convergence of those samples with the warmup process (also called 'burn-in') in HMC. When doing HMC MCMC we typically don't start sampling right away (or, more precisely we throw out those samples) because we may be initializing the sampler in a part of the distribution that involves pretty low probability density. After the chain has run awhile it tends to end up sampling from the typical set which, especially in high dimensional distribution, tends to more correctly represent the distribution we actually want to integrate over.

So for language when I say "Bob has three apples, Jane gives him four and Judy takes two how many apples does Bob have" we're actually pretty far from the part of the linguistic manifold where the correct answer is likely to be. As the chain wanders this space it's getting closer until it finally statistically follow the path "this answer is..." and when it's sampling from this path it's in a much more likely neighborhood of the correct answer. That is, after wandering a bit, more and more of the possible paths are closer to where the actual answer lies than they would be if we had just forced the model to choose early.

edit: Michael Betancourt has great introduction to HMC which covers warm-up and the typical set https://arxiv.org/pdf/1701.02434 (he has a ton more content that dives much more deeply into the specifics)

➕ show 1 reply

alt Hacker News

Replies