My guess is that autoregressive models can use Key Value (KV) caching to eliminate most of the FLOPs inside the self-attention block. Can't use KV caching inside diffusion (because it's not a causal model) but they sell this as a win anyway because they believe it leads to better reasoning.