LLMs are fed token ids, out of a tokenizer.... no characters. They don't even have any concept ...

danielmarkbruce • 01/20/2025 • 2 replies • view on HN

LLMs are fed token ids, out of a tokenizer.... no characters. They don't even have any concept of a character.

You are in a discussion where you are just miles out of your depth. Go read LLMs 101 somewhere.

Replies

michaelt • 01/21/2025

If the LLM hasn't learned the letters that comprise input tokens, how do you explain this sort of behaviour?

https://chatgpt.com/share/678e95cf-5668-8011-b261-f96ce5a33a...

It can literally spell out words, one letter per line.

Seems pretty clear to me the training data contained sufficient information for the LLM to figure out which tokens correspond to which letters.

And it's no surprise the training data would contain such content - it'd be pretty easy to synthetically generate misspellings, and being able to deal with typos and OCR mistakes gracefully would be useful in many applications.

➕ show 1 reply

HarHarVeryFunny • 01/21/2025

You're the one out of your depth ...

LLMs are taught to predict. Once they've seen enough training samples of words being spelled, they'll have learnt that in a spelling context the tokens comprising the word predict the tokens comprising the spelling.

Once they've learnt the letters predicted by each token, they'll be able to do this for any word (i.e. token sequence).

Of course, you could just try it for yourself - ask an LLM to break a non-dictionary nonsense word like "asdpotyg" into a letter sequence.

➕ show 2 replies

alt Hacker News

Replies