The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.
It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.
In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.
I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/
LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.
That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.
Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.
> It's like asking a blind person to count the number of colors on a car.
I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.