I’ve replicated the OthelloGPT results mentioned in this paper personally - and it def felt like the next-move-only accuracy metric was not everything. Indeed, the authors of the original paper knew this, and so further validated the world model by intervening in a model’s forward pass to directly manipulate the world model (and check the resulting change in valid move predictions).
I’d also recommend checking out Neel Nanda’s work on OthelloGPT, where he demonstrated the world model was actually linear: https://arxiv.org/abs/2309.00941