logoalt Hacker News

pona-ayesterday at 5:48 PM2 repliesview on HN

Didn't tokenization already have one bitter lesson: that it's better to let simple statistics guide the splitting, rather than expert morphology models? Would this technically be a more bitter lesson?


Replies

empikoyesterday at 7:17 PM

Agreed completely. There is a ton of research into how to represent text, and these simple tokenizers are consistently performing on SOTA levels. The bitter lesson is that you should not worry about it that much.

kingstnapyesterday at 8:37 PM

Simple statistics aren't some be all. There was a huge improvement in Python coding by fixing the tokenization of indents in Python code.

Specifically they made tokens for 4,8,12,16 or something spaces.