You're still not getting it ...
Why would an LLM need to "break down" tokens into letters to do spelling?! That is just not how they work - they work by PREDICTION. If you ask an LLM to break a word into a sequence of letters, it is NOT trying to break it into a sequence of letters - it is trying to do the only thing it was trained to do, which is to predict what tokens (based on the training samples) most likely follow such a request, something that it can easily learn given a few examples in the training set.
The LLM can't, thats what makes it relatively difficult. The tokenizer can.
Run it through your head with character level tokenization. Imagine the attention calculations. See how easy it would be? See how few samples would be required? It's a trivial thing when the tokenizer breaks everything down to characters.
Consider the amount and specificity of training data required to learn spelling 'games' using current tokenization schemes. Vocabularies of 100,000 plus tokens, many of which are close together in high dimensional space but spelled very differently. Then consider the various data sets which give phonetic information as a method to spell. They'd be tokenized in ways which confuse a model.
Look, maybe go build one. Your head will spin once you start dealing with the various types of training data and how different tokenization changes things. It screws spelling, math, code, technical biology material, financial material. I specifically build models for financial markets and it's an issue.