Decoder only LLMs are Markov chains with sophisticated models of the state space. Anyone familiar with Hamiltonian Markov Chains will know that for good results you need a warm up period so that you're sampling from the typical set which is the area with generally the highest probability density in the distribution (not necessary the high propbability/maximum likelihood).
I have spent a lot of time experimenting with Chain of Thought professionally and I have yet to see any evidence to suggest that what's happening with CoT is any more (or less) than this. If you let the model run a bit longer it enters a region close to the typical set and when it's ready to answer you have a high probability of getting a good answer.
There's absolutely no "reasoning" going on here, except that some times sampling from the typical set near the region of your answer is going to look very similar to how human reason before coming up with an answer.
In RNNs and Transformers we obtain probability distribution of target variable directly and sample using methods like top-k or temprature sampling.
I don't see the equivalence to MCMC. It's not like we have a complex probability function that we are trying to sample from using a chain.
It's just logistic regression at each step.
How does MC warm-up fit with LLMs? With LLMs you start with a prompt, so I don't see how "warm up" applies.
You're not just sampling from them like some MC cases.
> If you let the model run a bit longer it enters a region close to the typical set and when it's ready to answer you have a high probability of getting a good answer.
What does "let the model run a bit longer" even mean in this context?
I don't understand the analogy.
If I'm using an MCMC algorithm to sample a probability distribution, I need to wait for my Markov chain to converge to a stationary distribution before sampling, sure.
But in no way is 'a good answer' a stationary state in the LLM Markov chain. If I continue running next-token prediction, I'm not going to start looping.