logoalt Hacker News

taeric10/11/20241 replyview on HN

This reminds me of the riddle of someone buying the numerals to put their address on their house. When you are looking at text, the point is all you have are the characters/symbols/tokens/whatever you want to call them. You can't really shepherd some over to their numeric value while leaving some at their token value. Unless you want to cause other issues when it comes time to reason about them later.

I'd hazard that the majority of numbers in most text are not such that they should be converted to a number, per se. Consider addresses, postal codes, phone numbers, ... ok, I may have run out of things to consider. :D


Replies

TZubiri10/14/2024

Perhaps I'm just missing fundamentals on tokenization.

But I fail to see how forcing tokenization at the digit level for numbers would somehow impact non numerical meanings of digits. The same characters always map to the same token through a simple mapping right? It's not like context and meaning changes tokenization:

That is:

my credit card ends in 4796 and my address is N street 1331

Parses to the same tokens as:

Multiply 4796 by 1331

So by tokenization digits we don't introduce the problem of different meanings to tokens depending on context.

show 1 reply