That’s only true if you tokenize words rather than characters. Character tokenization generates new...

docmechanic • 04/23/2025 • 3 replies • view on HN

That’s only true if you tokenize words rather than characters. Character tokenization generates new content outside the training vocabulary.

Replies

selfhoster11 • 04/24/2025

All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.

➕ show 1 reply

asdff • 04/24/2025

Why stop there? Just have it spit out the state of the bits on the hardware. English seems like a serious shackle for an LLM.

emaro • 04/24/2025

Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.

➕ show 1 reply

alt Hacker News

Replies