"Next token prediction" is not an answer. It's mental shortcut. An excuse not to think about the implications. An excuse a lot of people are eager to take.
First, autoregressive next token prediction can be Turing complete. This alone should give you a big old pause before you say "can't do X".
Second, "next token prediction" is what happens at an exposed top of an entire iceberg worth of incredibly poorly understood computation. An LLM is made not by humans, but by an inhuman optimization process. No one truly "understands" how an LLM actually works, but many delude themselves into thinking that they do.
And third, the task a base model LLM is trained for - what the optimization process was optimizing for? Text completion. Now, what is text? A product of human thinking expressed in natural language. And the LLM is forced to conform to the shape.
How close does it get in practice to the original?
Not close enough to a full copy, clearly. But close enough that even the flaws of human thinking are often reproduced faithfully.
> First, autoregressive next token prediction can be Turing complete. This alone should give you a big old pause before you say "can't do X".
Lots of things are Turing complete. We don't usually think they're smart, unless it's the first time we see a computer and have no idea how it works
An LLM is a markov chain mathematically. We can build an LLM with a context window of one token and it's basically a token frequency table. We can make the context window bigger and it becomes better at generating plausible looking text.
Is it possible that beyond becoming better at generating plausible looking text – the expected and observed outcome – it also gains some actual intelligence? It's very hard to disprove, but occam's razor might not be kind to it.