This paper has a good solution:

cschmidt • yesterday at 6:45 PM • 3 replies • view on HN

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

Replies

nielsole • today at 7:58 AM

Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

➕ show 2 replies

Y_Y • today at 8:23 AM

What do the vector space embeddings for digit strings even look like? Can you do arithmetic on them? If that's even desirable that it seems like you could just skip "embedding" altogether and intern all the numbers along one dimension.

jvanderbot • yesterday at 11:58 PM

Ok great! This is precisely how I chunk numbers for comparison. And not to diminish a solid result or the usefulness of it or the baseline tech: its clear that it we keep having to create situation - specific inputs or processes, we're not at AGI with this baseline tech

➕ show 1 reply

alt Hacker News

Replies