logoalt Hacker News

danielmarkbruce01/21/20250 repliesview on HN

Two answers: 1 - ChatGPT isn't an LLM, its an application using one/many LLMs and other tools (likely routing that to a split function).

2 - even for a single model 'call':

It can be explained with the following training samples:

"tree is spelled t r e e" and "tree has 2 e's in it"

The problem is, the LLM has seen something like:

8062, 382, 136824, 260, 428, 319, 319

and

19816, 853, 220, 17, 319, 885, 306, 480

For a lot of words, it will have seen data that results in it saying something sensible. But it's fragile. If LLMs used character level tokenization, you'd see the first example repeat the token for e in tree rather than tree having it's own token.

There are all manner of tradeoffs made in a tokenization scheme. One example is that openai made a change in space tokenization so that it would produce better python code.