Its not a statistical next word predictor.
The 'predicting the next word' is the learning mechanism of the LLM which leads to a latent space which can encode higher level concepts.
Basically a LLM 'understands' that much as efficient as it has to be to be able to respond in a reasonable way.
A LLM doesn't predict german text or chinese language. It predicts the concept and than has a language layer outputting tokens.
And its not just LLMs which are progressing fast, voice synt and voice understanding jumped significantly, motion detection, skeletion movement, virtual world generation (see nvidias way of generating virutal worlds for their car training), protein folding etc.
> Its not a statistical next word predictor.
it absolutely is a next word predictor
LLM proponents believe that these higher level encodings in latent space do in fact match the real world concepts described by our language(s).
However, a much simpler explanation for what we see with LLMs is that instead the higher level encodings in latent space match only the patterns of our language(s), and no deeper encoding/understanding is present.
It's Plato's Cave - the shadows on the wall are all an LLM ever sees, and somehow it is expected to derive the real reality behind them.
I'm sorry but the input to a model is a sequence of tokens and the output is a probability distribution of what's the most likely next token. It's a very very very fancy next token predictor but that is fundamentally what it is. I'm making the argument that this paradigm might not give rise to a general intelligence no matter how much you scale it.