I think tokenization is probably not going anywhere, but higher layers need the ability to inspect 'raw' data on demand. You don't spell out most words as you read them, but you can bring the focus of your entire mind to the spelling of the word strawberry if you so choose. Models need that ability as well.
Couldn’t this be solved by replacing the tokenized input with a model that outputs the tokenization and then training the entire thing as one larger model? The goal would be to make tokenization a function of the model input.