logoalt Hacker News

danielmarkbruce01/21/20251 replyview on HN

Go try it. I've done it.

You are going to find for 1) with character level tokenization you don't need to have data for every token for it to learn. For current tokenization schemes you do, and it still goes haywire from time to time when tokens which are close in space are spelled very differently.

Just try it, actually training one yourself.


Replies

HarHarVeryFunny01/21/2025

I don't doubt that training an LLM, and curating a training set, is a black art. Conventional wisdom was that up until a few years ago there were only a few dozen people in the world who knew all the tricks.

However, that is not what we were discussing.

You keep flip flopping on how you think these successfully trained frontier models are working and managing to predict the character level sequences represented by multi-character tokens ... one minute you say it's due to having learnt from an onerous amount of data, and the next you say they must be using a split function (if that's the silver bullet, then why are you not using one yourself, I wonder).

Near the top of this thread you opined that failure to count r's in strawberry is "Because they can't break down a token or have any concept of it". It's a bit like saying that birds can't fly because they don't know how to apply Bernoulli's principle. Wrong conclusion, irrelevant logic. At least now you seem to have progressed to (on occasion) admitting that they may learn to predict token -> character sequences given enough data.

If I happen into a few million dollars of spare cash, maybe I will try to train a frontier model, but frankly it seems a bit of an expensive way to verify that if done correctly it'd be able to spell "strawberry", even if using a penny-pinching tokenization scheme.

show 1 reply