A sequence of characters is grouped into a "token." The set of all such possible sequences forms a vocabulary. Without loss of generality, consider the example: strawberry -> straw | ber | ry -> 3940, 3231, 1029 -> [vector for each token]. The raw input to the model is not a sequence of characters, but a sequence of token embeddings each representing a learned vector for a specific chunk of characters. These embeddings contain no explicit information about the individual characters within the token. As a result, if the model needs to reason about characters, for example, to count the number of letters in a word, it must memorize the character composition of each token. Given that large models like GPT-4 use vocabularies with 100k–200k tokens, it's not surprising that the model hasn't memorized the full character breakdown of every token. I can't imagine that many "character level" questions exist in the training data.
In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.
I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.
It isn't at all obvious to me that the LLM can decide to blur their vision, so to speak, and see the tokens as tokens: they don't get to run a program on this data in some raw format, and even if they do attempt to write a program and run it in a sandbox they would have to "remember" what they were given and then regenerate it (well, I guess a tool could give them access to the history of their input, but at that point that tool likely sees characters), rather than to copy it. I am 100% with andy99 on this: it isn't anywhere near as simple as you are making it out to be.