logoalt Hacker News

philsnowtoday at 3:27 PM2 repliesview on HN

I'm reminded by the caveman skill of the clipped writing style used in telegrams, and your post further reminded me of "standard" books of telegram abbreviations. Take a look at [0]; could we train models to use this kind of code and then decode it in the browser? These are "rich" tokens (they succinctly carry a lot of information).

[0] https://books.google.com/books?id=VO4OAAAAYAAJ&pg=PA464#v=on...


Replies

derefrtoday at 5:06 PM

I would point out that the default BPE tokenization vocabulary used by many models (cl100k_base) is already a pretty powerful shorthand. It has a lot of short tokens, sure. But then:

Token ID 73700 is the literal entire (space-prefixed) word " strawberry". (Which neatly explains the "strawberry problem.")

Token ID 27128 is " cryptocurrency". (And 41698 is " disappointment".)

Token ID 44078 is " UnsupportedOperationException"!

Token ID 58040 is 128 spaces in a row (and is the longest token in the vocabulary.)

You'd be surprised how well this vocabulary can compress English prose — especially prose interspersed with code!

beau_gtoday at 8:50 PM

For a while I was missing the ability one uses all the time in stable diffusion prompts of using parentheses and floats to emphasize weight to different parts of the prompt. The more I thought about how it would work in an LLM though, the more I realized it's just reinventing code syntax and you could just give a code snippet to the LLM prompt.