I realized that with tokenization, there's a theoretical bottleneck when predicting the next to...

Scene_Cast2 • yesterday at 3:24 PM • 6 replies • view on HN

I realized that with tokenization, there's a theoretical bottleneck when predicting the next token.

Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.

Replies

molf • yesterday at 4:22 PM

The key insight is that you can represent different features by vectors that aren't exactly perpendicular, just nearly perpendicular (for example between 85 and 95 degrees apart). If you tolerate such noise then the number of vectors you can fit grows exponentially relative to the number of dimensions.

12288 dimensions (GPT3 size) can fit more than 40 billion nearly perpendicular vectors.

[1]: https://www.3blue1brown.com/lessons/mlp#superposition

➕ show 1 reply

blackbear_ • yesterday at 3:52 PM

While the theoretical bottleneck is there, it is far less restrictive than what you are describing, because the number of almost orthogonal vectors grows exponentially with ambient dimensionality. And orthogonality is what matters to differentiate between different vectors: since any distribution can be expressed as a mixture of Gaussians, the number of separate concepts that you can encode with such a mixture also grows exponentially

➕ show 1 reply

imurray • yesterday at 5:31 PM

A PhD thesis that explores some aspects of the limitation: https://era.ed.ac.uk/handle/1842/42931

Detecting and preventing unargmaxable outputs in bottlenecked neural networks, Andreas Grivas (2024)

unoti • yesterday at 3:40 PM

I imagine there’s actually combinatorial power in there though. If we imagine embedding something with only 2 dimensions x and y, we can actually encode an unlimited number of concepts because we can imagine distinct separate clusters or neighborhoods spread out over a large 2d map. It’s of course much more possible with more dimensions.

incognito124 • yesterday at 6:14 PM

(I left academia a while ago, this might be nonsense)

If I remember correctly, that's not true because of the nonlinearities which provide the model with more expressivity. Transformation from 15k to 1k is rarely an affine map, it's usually highly non-linear.

kevingadd • yesterday at 4:16 PM

It seems like you're assuming that models are trying to predict the next token. Is that really how they work? I would have assumed that tokenization is an input-only measure, so you have perhaps up to 50k unique input tokens available, but output is raw text or synthesized speech or an image. The output is not tokens so there are no limitations on the output.

➕ show 1 reply

alt Hacker News

Replies