logoalt Hacker News

noosphrtoday at 3:14 AM4 repliesview on HN

The Chinese alphabet is very much a dictionary. All the major tokenizers are far larger.


Replies

dparktoday at 3:39 AM

That doesn’t make any sense. A alphabet is a list of valid characters. A dictionary is not just a list. Even in a language like Chinese where individual characters carry meaning, a dictionary tells you what that meaning is. It’s not just a list of characters.

Or to echo article, the dictionary is made out of weights.

simonhtoday at 3:48 AM

A list of words isn’t a dictionary. What a dictionary adds over a list of words is all the relationships between the words needed to interpret them and use them, and all of that is in the weights.

show 1 reply
maxbondtoday at 9:29 AM

It's beside the point and so I only note it out of interest, but the Chinese writing system doesn't use an alphabet (or a syllabary like Japanese kana), it's logography.

canjobeartoday at 3:39 AM

A mapping of Chinese characters to integers (like a tokenizer) would not be a dictionary. You’d also need definitions. At best it’s an index to a hypothetical dictionary.