logoalt Hacker News

eigenspacetoday at 8:01 AM1 replyview on HN

They're text generators, but you can think of them as basically operating with a different alphabet than us. When they are given text input, it's not in our alphabet, and when they produce text output it's also not in our alphabet. So when you ask them what letters are in a given word, they're literally just guessing when they respond.

Rather, they use tokens that are usually combinations of 2-8 characters. You can play around with how text gets tokenized here: https://platform.openai.com/tokenizer

_____

For example, the above text I wrote has 504 characters, but 103 tokens.


Replies

klibertptoday at 1:57 PM

For Latin alphabet-based languages, it's pretty similar to how names from those languages are transliterated to Japanese or Korean. You get "Clare" in English and (what, to me, sounds like) "Kurea" in Japanese; equivalent (I'm told!) but not the same. It would be wrong to try to assess the IQ of Japanese (who don't know English) by asking about properties of the original word that are not shared by the Japanese equivalent. On the other hand, English speakers won't ever experience haiku fully, since the script plays a big role in the composition (according to what I'm told... I don't know Japanese, but anime intake exposed me to opinions like this; and even if I'm dead wrong with details, it sounds like a plausible analogy, at least...)