In RNNs and Transformers we obtain probability distribution of target variable directly and sample using methods like top-k or temprature sampling.
I don't see the equivalence to MCMC. It's not like we have a complex probability function that we are trying to sample from using a chain.
It's just logistic regression at each step.
Right, you're describing sampling a single token which is equivalent to sampling from one step in the Markov Chain. When generating output you're repeating this process and updating your state sequentially which is the definition of the Markov Chain since at each state the embedding (which represents our current state) is conditionally independent of the past.
Every response from an LLM is essentially the sampling of a Markov chain.