logoalt Hacker News

HarHarVeryFunny11/07/20249 repliesview on HN

An LLM necessarily has to create some sort of internal "model" / representations pursuant to its "predict next word" training goal, given the depth and sophistication of context recognition needed to to well. This isn't an N-gram model restricted to just looking at surface word sequences.

However, the question should be what sort of internal "model" has it built? It seems fashionable to refer to this as a "world model", but IMO this isn't really appropriate, and certainly it's going to be quite different to the predictive representations that any animal that interacts with the world, and learns from those interactions, will have built.

The thing is that an LLM is an auto-regressive model - it is trying to predict continuations of training set samples solely based on word sequences, and is not privy to the world that is actually being described by those word sequences. It can't model the generative process of the humans who created those training set samples because that generative process has different inputs - sensory ones (in addition to auto-regressive ones).

The "world model" of a human, or any other animal, is built pursuant to predicting the environment, but not in a purely passive way (such as a multi-modal LLM predicting next frame in a video). The animal is primarily concerned with predicting the outcomes of it's interactions with the environment, driven by the evolutionary pressure to learn to act in way that maximizes survival and proliferation of its DNA. This is the nature of a real "world model" - it's modelling the world (as perceived thru sensory inputs) as a dynamical process reacting to the actions of the animal. This is very different to the passive "context patterns" learnt by an LLM that are merely predicting auto-regressive continuations (whether just words, or multi-modal video frames/etc).


Replies

mistercow11/07/2024

> It can't model the generative process of the humans who created those training set samples because that generative process has different inputs - sensory ones (in addition to auto-regressive ones).

I think that’s too strong a statement. I would say that it’s very constrained in its ability to model that, but not having access to the same inputs doesn’t mean you can’t model a process.

For example, we model hurricanes based on measurements taken from satellites. Those aren’t the actual inputs to the hurricane itself, but abstracted correlates of those inputs. An LLM does have access to correlates of the inputs to human writing, i.e. textual descriptions of sensory inputs.

show 3 replies
madaxe_again11/07/2024

You say this, yet people such as Helen Keller suggest that a full sensorium is not necessary to be a full human. She had some grasp of the idea of colour, of sound, and could use the words around them appropriately - yet had no firsthand experience of either. Is it really so different?

I think “we” each comprise a number of models, language being just one of them - however an extremely powerful one, as it allows the transmission of thought across time and space. It’s therefore understandable that much of what we recognise as conscious thought, of a model of the world, emerges from such an information dense system. It’s literally developed to describe the world, efficiently and completely, and so that symbol map an LLM carries possibly isn’t that different to our own.

show 1 reply
comfysocks11/09/2024

It seems to me that the human authors of the training text are the ones who have created the “world model”, and have encoded it into written language. The llm transcodes this model into word embedding vector space. I think most people can recognize a high dimensional vector space as a reasonable foundation for a mathematical “model”. The humans are the ones who have interacted with the world and have perceived its workings. The llm only interacts with the human’s language model. Some credit must be given to the humans modellers for the unreasonable effectiveness of the llm.

machiaweliczny11/07/2024

But if you squint then sensory actions and reactions are also sequential tokens. Even reactions can be encoded alongside input as action tokens and as single token stream. Anyone tried sth like this?

show 1 reply
dsubburam11/07/2024

> The "world model" of a human, or any other animal, is built pursuant to predicting the environment

What do you make of Immanuel Kant's claim that all thinking has as a basis the presumption of the "Categories"--fundamental concepts like quantity, quality and causality[1]. Do LLMs need to develop a deep understanding of these?

[1] https://plato.stanford.edu/entries/categories/#KanCon

show 1 reply
lxgr11/07/2024

But isn't the distinction between a "passive" and an "active" model ultimately a metaphysical (freedom of will vs. determinism) question, under the (possibly practically infeasible) assumption that the passive model gets to witness all possible actions an agent might take?

Practically, I could definitely imagine interesting outcomes from e.g. hooking up a model to a high-fidelity physics simulator during training.

stonemetal1211/07/2024

People around here like to say "The map isn't the territory". If we are talking about the physical world, then language is a map not the territory, and not a detailed one either, an LLM trained on it is a second order map.

If we consider the territory to be human intelligence, then language is still a map but it is a much more detailed map. Thus an LLM trained on it becomes a more interesting second order map.

seydor11/07/2024

Animals could well use an autoregressive model to predict the outcomes of their actions on their perceptions. It's not like we run math in out everyday actions (it would take too long).

Perhaps thats why we can easily communicate those predictions as words

ElevenLathe11/09/2024

We can't see neutrons either, but we have built various models of them based on indirect observations.