Because there are some really fundamental things they cannot do with next token prediction. For instance, their memory is akin to someone who reads the phone book and memorizes the entire thing, but can't tell you what a phone number is for. Moreover, they can mimic semantic knowledge, because they have been trained on that knowledge, but take them out of their training distribution and they get into a "creative story-telling" mode very quickly. They can quote me all the rules of chess, but when it comes to actually making a chess move they break those rules with abandon simply because they didn't actually understand the rules. Chess is instructive in another way, too, in that you can get them to play a pretty solid opening game, maybe 10, 15 moves in, but then they start forgetting pieces, creating board positions that are impossible to reach, etc. They have memorized the forms of a board, know the names of the pieces, but they have no true understanding of what a chess game is. Coding is similar, they're fine when you give them Python or Bash shell scripts to write, they've been heavily trained on those, but ask them to deal with a system that has a non-standard stack and they will go haywire if you let their context get even medium sized. Something else they lack is any kind of learning efficiency as you or I would understand the concept. By this I mean the entire Internet is not sufficient to train today's models, the labs have to synthesize new data for models to train on to get sufficient coverage of a given area they want the model to be knowledgeable about. Continuous learning is a well-known issue as well, they simply don't do it. The labs have created memory, which is just more context engineering, but it's not the same as updating as you interact with them. I could go on.
At the end of the day next token prediction is a sleight of hand. It produces amazingly powerful affects, I agree. You can turn this one magic trick into the illusion of reasoning, but what it's doing is more of a "one thing after another" style story-telling that is fine for a lot of things, but doesn't get to the heart of what intelligence means. If you want to call them intelligent because they can do this stuff, fine, but it's an alien kind of intelligence that is incredibly limited. A dog or a cat actually demonstrate more ability to learn, to contextualize, and to make meaning.
None of this is a logical certainty of "X, therefore Y", it's just opinions. You can trivially add memory to a model by continuing to train it, we just don't do it because it's expensive, not because it can't be done.
Also, the phone book example is off the mark, because if I take a human who's never seen a phone and ask them to memorise the phone book, they would (or not), while not knowing what a phone number was for. Did you expect that a human would just come up on knowledge about phones entirely on their own, from nothing?
Next token prediction is about predicting the future by minimizing the number of bits required to encode the past. It is fundamentally causal and has a discrete time domain. You can't predict token N+2 without having first predicted token N+1. The human brain has the same operational principles.
You didn't actually give an example of what the issue with next token prediction is. You just mentioned current constraints (ie generalization and learning are difficult, needs mountains of data to train, can't play chess very well) that are not fundamental problems. You can trivially train a transformer to play chess above the level any human can play at, and they would still be doing "next token prediction". I wouldn't be surprised if every single thing you list as a challenge is solved in a few years, either through improvement at a basic level (ie better architectures) or harnessing.
We don't know how human brains produce intelligence. At a fundamental level, they might also be doing next token prediction or something similarly "dumb". Just because we know the basic mechanism of how LLMs work doesn't mean we can explain how they work and what they do, in a similar way that we might know everything we need to know about neurons and we still cannot fully grasp sentience.