logoalt Hacker News

TZubiri10/14/20241 replyview on HN

Perhaps I'm just missing fundamentals on tokenization.

But I fail to see how forcing tokenization at the digit level for numbers would somehow impact non numerical meanings of digits. The same characters always map to the same token through a simple mapping right? It's not like context and meaning changes tokenization:

That is:

my credit card ends in 4796 and my address is N street 1331

Parses to the same tokens as:

Multiply 4796 by 1331

So by tokenization digits we don't introduce the problem of different meanings to tokens depending on context.


Replies

taeric10/14/2024

I think I see your point, but how would you want to include localized numbers, such as 1,024 in a stream? Would you assume all 0x123 numbers are hex, as that is a common norm? Does the tokenizer already know to read scientific numbers? 1e2, as an example?

That is all to say that numbers in text are already surprisingly flexible. The point of taking the tokens is to let the model lean the flexibility. It is the same reason that we don't tokenize at the word level. Or try to get a soundex normalization. All of these are probably worth at least trying. May even do better in some contexts? The general framework has a reason to be, though.