It is the other way around. If the data is causal and presented in the causal order, it is impossible to beat the loss of a pure auto-regressive model because it has the correct probability distribution for the dataset. Language data is mostly causal (as words follow in the context of previous words when they are spoken/written). Most of the remaining additional info in the extreme oversampling of the same data via diffusion models should be there by using fill-in-the-middle or order-reversal strategies with AR models as well and with significant compute savings during training.
It is the other way around. If the data is causal and presented in the causal order, it is impossible to beat the loss of a pure auto-regressive model because it has the correct probability distribution for the dataset. Language data is mostly causal (as words follow in the context of previous words when they are spoken/written). Most of the remaining additional info in the extreme oversampling of the same data via diffusion models should be there by using fill-in-the-middle or order-reversal strategies with AR models as well and with significant compute savings during training.