There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.
[0] https://www.alphaxiv.org/abs/2509.14786
yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.
yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.