> You seem to think that predicting s t -> s t is easier than predicting st (single token) -> s t.
Yes, it is significantly easier to train a model to do the first than the second across any real vocabulary. If you don't understand why, maybe go back to basics.
No, because it still has to learn what to predict when "spelling" is called for. There's no magic just because the predicted token sequence is the same as the predicting one (+/- any quotes, commas, etc).
And ...
1) If the training data isn't there, it still won't learn it
2) Having to learn that the predictive signal is a multi-token pattern (s t) vs a single token one (st) isn't making things any simpler for the model.
Clearly you've decided to go based on personal belief rather that actually testing for yourself, so the conversation is rather pointless.