That’s only true if you tokenize words rather than characters. Character tokenization generates new content outside the training vocabulary.
Why stop there? Just have it spit out the state of the bits on the hardware. English seems like a serious shackle for an LLM.
Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.
All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.