logoalt Hacker News

TZubiri10/11/20242 repliesview on HN

Wouldn't a slight change in tokenization? (say mapping single digits to single tokens) help with this specific challenge?


Replies

wenc10/11/2024

Aren’t coding copilots based on tokenizing programming language keywords and syntax? That seems to me to be domain specific tokenization (a very well defined one too — since programming languages are meant to be tokenizable).

Math is a bit trickier since most of the world’s math is in LaTeX, which is more of a formatting language than a syntax tree. There needs to be a conversion to MathML or something more symbolic.

Even English word tokenization has gaps today. Claude Sonnet 3.5 still fails on the question “how many r’s are there in strawberry”.

show 1 reply
bob102910/11/2024

Context-specific tokenization sounds a lot like old fashioned programming.