Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.
Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.