> I wonder how much of this is due to Diffusion models having less capacity for memorization than auto regressive models
Diffusion requires more computation resources than autoregressive models, compute excess is proportional to the length of sequence. Time dilated RNNs and adaptive computation in image recognition hint us that we can compute more with same weights and achieve better results.
Which, I believe, also hint at the at least one flaw of the TS study - I did not see that they matched DLM and AR by compute, they matched them only by weights.
Do you have references on adaptive methods for image recognition?