You need to think about 1) the latent state 2) the fact that part of the model is post trained to bi...

igorkraw • today at 6:53 AM • 0 replies • view on HN

You need to think about 1) the latent state 2) the fact that part of the model is post trained to bias the MC towards abiding by the query in the sense of the reward.

A way to look at it is that you effectively have 2 model "heads" inside the LLM, one which generates, one which biases/steers.

The MCMC is initialised based on your prompt, the generator part samples from the language distribution it has learned, while the sharpening/filtering part biases towards stuff that would be likely to have this MCMC give high rewards in the end. So the model regurgitates all the context that is deemed possibly relevant based on traces from the training data (including "tool use", which then injects additional context) and all those tokens shift the latent state into something that is more and more typical of your query.

Importantly, attention acts as a Selector and has multiple heads, and these specialize, so (simplified) one head can maintain focus on your query and "judge" the latent state, while the rest can follow that Markov chain until some subset of the generated+tool injected tokens give enough signal to the "answer now" gate that the middle flips into "summarizing" mode, which then uses the latent state of all of those tokens to actually generate the answer.

So you very much can think of it as sampling repeatedly from an MCMC using a bias, A learned stoping rule and then having a model creating the best possible combination of the traces, except that all this machinery is encoded in the same model weights that get to reuse features between another, for all the benefits and drawbacks that yields.

There was a paper close when OF became a thing that showed that instead of doing CoT, you could just spend that token budget on K parallel shorter queries (by injecting sth. Like "ok, to summarize" and "actually" to force completion ) and pick the best one/majority vote. Since then RLHF has made longer traces more in distribution (although there's another paper that showed as of early 2025 you were trading reduced variance and peak performance as well as loss of edge cases for higher performance on common cases , although this might be ameliorated by now) but that's about the way it broke down 2024-2025

alt Hacker News