Aren’t coding copilots based on tokenizing programming language keywords and syntax? That seems to m...

wenc • 10/11/2024 • 1 reply • view on HN

Aren’t coding copilots based on tokenizing programming language keywords and syntax? That seems to me to be domain specific tokenization (a very well defined one too — since programming languages are meant to be tokenizable).

Math is a bit trickier since most of the world’s math is in LaTeX, which is more of a formatting language than a syntax tree. There needs to be a conversion to MathML or something more symbolic.

Even English word tokenization has gaps today. Claude Sonnet 3.5 still fails on the question “how many r’s are there in strawberry”.

Replies

gwillen • 10/11/2024

> Aren’t coding copilots based on tokenizing programming language keywords and syntax?

No, they use the same tokenization as everyone else. There was one major change from early to modern LLM tokenization, made (as far as I can tell) for efficient tokenization of code: early tokenizers always made a space its own token (unless attached to an adjacent word.) Modern tokenizers can group many spaces together.

alt Hacker News

Replies