Really glad to see some academic research on this- it was quite obvious from interacting with LLMs that they form a world model and can, e.g. simulate simple physics experiments correctly that are not in the training set. I found it very frustrating to see people repeating the idea that “it can never do x” because it lacks a world model. Predicting text that represents events in the world requires modeling that world. Just because you can find examples where the predictions of a certain model are bad does not imply no model at all. At the limit of prediction becoming as good as theoretically possible given the input data and model size restrictions, the model also becomes as accurate and complete as possible. This process is formally described by the Solomonoff Induction theory.
> At the limit of prediction becoming as good as theoretically possible given the input data and model size restrictions
You are treading on delicate ground here. Why do you believe that sequence models are capable of reaching theoretical maximums?