Wrong as it is, I'm impressed they were able to get any maps out of their LLM that look vaguely cohesive. The shortest path map has bits of streets downtown and around Central Park that aren't totally red, and Central Park itself is clear on all 3 maps.
They used eight A100s, but don't say how long it took to train their LLM. It would be interesting to know the wall clock time they spent. Their dataset is, relatively speaking, tiny which means it should take fewer resources to replicate from scratch.
What's interesting though is that the Smalley model performed better, though they don't speculate why that is.
It's a bit unclear what the map visualisations are showing to me, but I don't think your interpretation is correct. They even say:
> Our evaluation methods reveal they are very far from recovering the true street map of New York City. As a visualization, we use graph reconstruction techniques to recover each model’s implicit street map of New York City. The resulting map bears little resemblance to the actual streets of Manhattan, containing streets with impossible physical orientations and flyovers above other streets.
I can't imagine training took more than a day with 8 A100 even with that vocab size [0] (does lightning do implicit vocab extension maybe?) and a batch size of 1 [1] or 64 [2] or 4096 [3] (I have not trawled through the repo and other wordk enough to see what they are actually using in the paper, and let's be real - we've all copied random min/nano/whatever GPT forks and not bothered renaming stuff). They mentioned their dataset is 120 million tokens, which is miniscule by transformer standards. Even with a more graph-based model making it 10X+ longer to train, 1.20 billion tokens per epoch equivalent shouldn't take more than a couple hours with no optimization.
[0] https://github.com/keyonvafa/world-model-evaluation/blob/949... [1] https://github.com/keyonvafa/world-model-evaluation/blob/949... [2] https://github.com/keyonvafa/world-model-evaluation/blob/949... [3] https://github.com/keyonvafa/world-model-evaluation/blob/mai...