The warmup process is necessary in order to try to find high-probability regions of the target distribution. That's not an issue for an LLM, since it's trained to sample directly from a distribution which looks like natural language.
There is some work on using MCMC to sample from higher-probability regions of an LLM distribution [1], but that's a separate thing. Nobody doubts that an LLM is sampling from its target distribution from the first token it outputs.