>From a domain point of view, some are skeptical that bytes are adequate for modelling natural la...

kgeist • yesterday at 11:10 PM • 1 reply • view on HN

>From a domain point of view, some are skeptical that bytes are adequate for modelling natural language

If I remember correctly, GPT3.5's tokenizer treated Cyrillic as individual characters, and GPT3.5 was pretty good at Russian.

dgfitz • today at 4:41 AM

I wonder if they treat each letter as a Unicode code point, and each of those is a token? I could see the same being true of other languages.

alt Hacker News