I think you're confusing the sampling process and the convergence of those samples with the ...

crystal_revenge • today at 4:48 AM • 3 replies • view on HN

I think you're confusing the sampling process and the convergence of those samples with the warmup process (also called 'burn-in') in HMC. When doing HMC MCMC we typically don't start sampling right away (or, more precisely we throw out those samples) because we may be initializing the sampler in a part of the distribution that involves pretty low probability density. After the chain has run awhile it tends to end up sampling from the typical set which, especially in high dimensional distribution, tends to more correctly represent the distribution we actually want to integrate over.

So for language when I say "Bob has three apples, Jane gives him four and Judy takes two how many apples does Bob have" we're actually pretty far from the part of the linguistic manifold where the correct answer is likely to be. As the chain wanders this space it's getting closer until it finally statistically follow the path "this answer is..." and when it's sampling from this path it's in a much more likely neighborhood of the correct answer. That is, after wandering a bit, more and more of the possible paths are closer to where the actual answer lies than they would be if we had just forced the model to choose early.

edit: Michael Betancourt has great introduction to HMC which covers warm-up and the typical set https://arxiv.org/pdf/1701.02434 (he has a ton more content that dives much more deeply into the specifics)

Replies

AlexCoventry • today at 8:25 AM

The warmup process is necessary in order to try to find high-probability regions of the target distribution. That's not an issue for an LLM, since it's trained to sample directly from a distribution which looks like natural language.

There is some work on using MCMC to sample from higher-probability regions of an LLM distribution [1], but that's a separate thing. Nobody doubts that an LLM is sampling from its target distribution from the first token it outputs.

[1] https://arxiv.org/abs/2510.14901

coldtea • today at 9:29 AM

> When doing HMC MCMC we typically don't start sampling right away (or, more precisely we throw out those samples) because we may be initializing the sampler in a part of the distribution that involves pretty low probability density.

And how that applies to LLMs? Since they don't do MCMC.

dhampi • today at 5:15 AM

No, I still don’t understand the analogy.

All of this burn-in stuff is designed to get your Markov chain to forget where it started.

But I don’t want to get from “how many apples does Bob have?” to a state where Bob and the apples are forgotten. I want to remember that state, and I probably want to stay close to it — not far away in the “typical set” of all language.

Are you implicitly conditioning the probability distribution or otherwise somehow cutting the manifold down? Then the analogy would be plausible to me, but I don’t understand what conditioning we’re doing and how the LLM respects that.

Or are you claiming that we want to travel to the “closest” high probability region somehow? So we’re not really doing burn-in but something a little more delicate?

alt Hacker News

Replies