> as that model was showing epoch over epoch improvements
Both of them were showing improvements. I agree with you that I'd like to see more, but I'm not sure more would significantly change the argument (which is a lot about how metrics aren't straight forward). Especially since the 96B token experiment shows.IN FACT, those results are so similar I had to open them up in GIMP to align and spot the differences. Now I'm actually not convinced there wasn't a mistake. There are differences, just very minor. Harder to tell with the AR model because scale, but in the diffusion you can see a little bump in the second one right before the concavity change at the end. There some more bumps in the AR model earlier on that help show differences too, but the fact that the envelopes are nearly identical is... suspicious. I'm not claiming maliciousness because even if a mistake these things are so easy to make that they are common. I'm not even convinced there is a mistake, but it warrants extra thinking.
That said, money is finite and these are quite computationally heavy. Author looks to be a research fellow and so I'm assuming not backed by big tech.