what bothers me is not that this issue will certainly disappear now that it has been identified, but that that we have yet to identify the category of these "stupid" bugs ...
We already know exactly what causes these bugs. They are not a fundamental problem of LLMs, they are a problem of tokenizers. The actual model simply doesn't get to see the same text that you see. It can only infer this stuff from related info it was trained on. It's as if someone asked you how many 1s there are in the binary representation of this text. You'd also need to convert it first to think it through, or use some external tool, even though your computer never saw anything else.
We already know exactly what causes these bugs. They are not a fundamental problem of LLMs, they are a problem of tokenizers. The actual model simply doesn't get to see the same text that you see. It can only infer this stuff from related info it was trained on. It's as if someone asked you how many 1s there are in the binary representation of this text. You'd also need to convert it first to think it through, or use some external tool, even though your computer never saw anything else.