Until I see evidence that an LLM trained at e.g. the character level _CAN_ successfully "count ...

krackers • yesterday at 6:56 PM • 3 replies • view on HN

Until I see evidence that an LLM trained at e.g. the character level _CAN_ successfully "count Rs" then I don't trust this explanation over any other hypothesis. I am not familiar with the literature so I don't know if this has been done, but I couldn't find anything with a quick search. Surely if someone did successfully do it they would have published it.

Replies

ijk • yesterday at 7:24 PM

The math tokenization research is probably closest.

GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://news.ycombinator.com/item?id=39728870 )

More recent research:

https://huggingface.co/spaces/huggingface/number-tokenizatio...

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903

https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...

➕ show 1 reply

anonymoushn • yesterday at 11:46 PM

There are various papers about this, maybe most prominently Byte-Latent Transformer.

alt Hacker News

Replies