I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the token...

cschmidt • 06/25/2025 • 2 replies • view on HN

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

Replies

infogulch • 06/25/2025

Little endian wins in the end.

pas • 06/25/2025

... why does reversing the all the digits help? could you please explain it? many thanks!

➕ show 1 reply

alt Hacker News

Replies