Forgive me if this is a naive assumption, but wouldn’t large language models be fundamentally different for a language that is largely symbols? Again, my understanding of Mandarin is limited if it exists at all.
"飞机" and "airplane" aren't fundamentally different in terms of how they're represented to a computer. Especially for an LLM, where tokenization likely turns each of those into a single token.
All tokens are symbols. All of the frontier models speak Mandarin.