There's no "trick behind the scenes" there. You can actually see the entire trick being performed right in front of you. You're just not paying attention.
That trick? The LLM has succeeded by spelling the entire word out letter by letter first.
It's much easier for an LLM to perform "tokenized word -> letters -> letter counts" than it is to perform "tokenized word -> letter counts" in one pass. But it doesn't know that! It copies human behavior from human text, and humans never had to deal with tokenizer issues in text!
You can either teach the LLM that explicitly, or just do RLVR on diverse tasks and hope it learns the tricks like this by itself.