logoalt Hacker News

docmechanic04/23/20253 repliesview on HN

That’s only true if you tokenize words rather than characters. Character tokenization generates new content outside the training vocabulary.


Replies

selfhoster1104/24/2025

All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.

show 1 reply
asdff04/24/2025

Why stop there? Just have it spit out the state of the bits on the hardware. English seems like a serious shackle for an LLM.

emaro04/24/2025

Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.

show 1 reply