I can't imagine training took more than a day with 8 A100 even with that vocab size [0] (does lightning do implicit vocab extension maybe?) and a batch size of 1 [1] or 64 [2] or 4096 [3] (I have not trawled through the repo and other wordk enough to see what they are actually using in the paper, and let's be real - we've all copied random min/nano/whatever GPT forks and not bothered renaming stuff). They mentioned their dataset is 120 million tokens, which is miniscule by transformer standards. Even with a more graph-based model making it 10X+ longer to train, 1.20 billion tokens per epoch equivalent shouldn't take more than a couple hours with no optimization.
[0] https://github.com/keyonvafa/world-model-evaluation/blob/949... [1] https://github.com/keyonvafa/world-model-evaluation/blob/949... [2] https://github.com/keyonvafa/world-model-evaluation/blob/949... [3] https://github.com/keyonvafa/world-model-evaluation/blob/mai...