logoalt Hacker News

gwillen10/11/20240 repliesview on HN

> Aren’t coding copilots based on tokenizing programming language keywords and syntax?

No, they use the same tokenization as everyone else. There was one major change from early to modern LLM tokenization, made (as far as I can tell) for efficient tokenization of code: early tokenizers always made a space its own token (unless attached to an adjacent word.) Modern tokenizers can group many spaces together.