logoalt Hacker News

whimsicalismyesterday at 5:38 PM0 repliesview on HN

> Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.

I would be surprised if there is much/any gradient acc in modern large-scale pretraining runs. You can always just recruit more GPUs with DP/PP/TP rather than training for longer.