I think you're right. Try asking GPT-5 this:
> Are the parentheses in ((((()))))) balanced?
There was a thread about this the other day [1]. It's the same issue as "count the r's in strawberry." Tokenization makes it hard to count characters. If you put that string into OpenAI's tokenizer, [2] this is how they are grouped:
Token 1: ((((
Token 2: ()))
Token 3: )))
Which of course isn't at all how our minds would group them together in order to keep track of them.
[1] https://news.ycombinator.com/item?id=47615876 [2] https://platform.openai.com/tokenizer
Don’t ask the LLM to do that directly: ask it to write a program to answer the question, then have it run the program. It works much better that way.
does the ai performance drop if it uses letters for tokens rather than tokens for tokens?
This is mostly because people wrongly assume that LLMs can count things. Just because it looks like it can, doesn't mean it is.
Try to get your favourite LLM to read the time from a clock face. It'll fail ridiculously most of the time, and come up with all kinds of wonky reasons for the failures.
It can code things that it's seen the logic for before. That's not the same as counting. That's outputing what it's previously seen as proper code (and even then it often fails. Probably 'cos there's a lot of crap code out there)