> This matters because (1) the world cannot be modeled anywhere close to completely with language alone
LLMs being "Language Models" means they model language, it doesn't mean they "model the world with language".
On the contrary, modeling language requires you to also model the world, but that's in the hidden state, and not using language.
Let's be more precise: LLMs have to model the world from an intermediate tokenized representation of the text on the internet. Most of this text is natural language, but to allow for e.g. code and math, let's say "tokens" to keep it generic, even though in practice, tokens mostly tokenize natural language.
LLMs can only model tokens, and tokens are produced by humans trying to model the world. Tokenized models are NOT the only kinds of models humans can produce (we can have visual, kinaesthetic, tactile, gustatory, and all sorts of sensory, non-linguistic models of the world).
LLMs are trained on tokenizations of text, and most of that text is humans attempting to translate their various models of the world into tokenized form. I.e. humans make tokenized models of their actual models (which are still just messy models of the world), and this is what LLMs are trained on.
So, do "LLMS model the world with language"? Well, they are constrained in that they can only model the world that is already modeled by language (generally: tokenized). So the "with" here is vague. But patterns encoded in the hidden state are still patterns of tokens.
Humans can have models that are much more complicated than patterns of tokens. Non-LLM models (e.g. models connected to sensors, such as those in self-driving vehicles, and VLMs) can use more than simple linguistic tokens to model the world, but LLMs are deeply constrained relative to humans, in this very specific sense.