Tokenization is a simplistic explanation which is likely wrong, at least in part. They're perfe...

orbital-decay • yesterday at 6:59 PM • 0 replies • view on HN

Tokenization is a simplistic explanation which is likely wrong, at least in part. They're perfectly fine reciting words character by character, using different tokenization strategies for the same word if forced to (e.g. replacing the starting space or breaking words up into basic character tokens), complex word formation in languages that heavily depend on it, etc. LLMs work with concepts rather than tokens.

alt Hacker News