logoalt Hacker News

linolevanyesterday at 10:23 PM1 replyview on HN

There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.

[0] https://www.alphaxiv.org/abs/2509.14786


Replies

sdpmasyesterday at 10:33 PM

yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.

show 1 reply