Aren’t coding copilots based on tokenizing programming language keywords and syntax? That seems to me to be domain specific tokenization (a very well defined one too — since programming languages are meant to be tokenizable).
Math is a bit trickier since most of the world’s math is in LaTeX, which is more of a formatting language than a syntax tree. There needs to be a conversion to MathML or something more symbolic.
Even English word tokenization has gaps today. Claude Sonnet 3.5 still fails on the question “how many r’s are there in strawberry”.
Aren’t coding copilots based on tokenizing programming language keywords and syntax? That seems to me to be domain specific tokenization (a very well defined one too — since programming languages are meant to be tokenizable).
Math is a bit trickier since most of the world’s math is in LaTeX, which is more of a formatting language than a syntax tree. There needs to be a conversion to MathML or something more symbolic.
Even English word tokenization has gaps today. Claude Sonnet 3.5 still fails on the question “how many r’s are there in strawberry”.