logoalt Hacker News

restersyesterday at 8:02 PM0 repliesview on HN

Tokenization as a form of preprocessing has the problems the authors mention. But it is also a useful way to think about data vs metadata and moving beyond text/image io into other domains. Ultimately we need symbolic representations of things, sure they are all ultimately bytes which the model could learn to self-organize, but things like that can be useful when humans interact with the data directly, in a sense, tokens make more aspects of LLM internals "human readable", and models should also be able to learn to overcome the limitations of a particular tokenization scheme.