I wonder how much of this is due to Diffusion models having less capacity for memorization than auto regressive models
The auto regressive models consistently show better loss for the same number of training tokens
I find a lot of the conclusions compelling but I would’ve loved to see more epochs of training on the 1B model with a 10B dataset, as that model was showing epoch over epoch improvements
> as that model was showing epoch over epoch improvements
Both of them were showing improvements. I agree with you that I'd like to see more, but I'm not sure more would significantly change the argument (which is a lot about how metrics aren't straight forward). Especially since the 96B token experiment shows.IN FACT, those results are so similar I had to open them up in GIMP to align and spot the differences. Now I'm actually not convinced there wasn't a mistake. There are differences, just very minor. Harder to tell with the AR model because scale, but in the diffusion you can see a little bump in the second one right before the concavity change at the end. There some more bumps in the AR model earlier on that help show differences too, but the fact that the envelopes are nearly identical is... suspicious. I'm not claiming maliciousness because even if a mistake these things are so easy to make that they are common. I'm not even convinced there is a mistake, but it warrants extra thinking.
That said, money is finite and these are quite computationally heavy. Author looks to be a research fellow and so I'm assuming not backed by big tech.
> The auto regressive models consistently show better loss for the same number of training tokens
I thought bi-directional transformers (non auto-regressive) show less loss than autoregressive for the same amount of training tokens.
> I wonder how much of this is due to Diffusion models having less capacity for memorization than auto regressive models
Diffusion requires more computation resources than autoregressive models, compute excess is proportional to the length of sequence. Time dilated RNNs and adaptive computation in image recognition hint us that we can compute more with same weights and achieve better results.
Which, I believe, also hint at the at least one flaw of the TS study - I did not see that they matched DLM and AR by compute, they matched them only by weights.