> i have a pretty good understanding of how transformers work but this did not make sense to me. also i dont understand why this strategy is applicable only to "code tokens"
Yes, there is a monstrous lack of detail here and you should be skeptical about most of the article claims. The language is also IMO non-standard (serious people don't talk about self-attention as lookup tables anymore, that was never a good analogy in the first place) and no good work would just use language to express this, there would also be a simple equation showing the typical scaled dot-product attention formula, and then e.g. some dimension notation/details indicating which matrix (or inserted projection matrix) got some dimension of two somewhere, otherwise, the claims are inscrutable (EDIT: see edit below).
There are also no training details or loss function details, both of which would be necessary (and almost certainly highly novel) to make this kind of thing end-to-end trainable, which is another red flag.
EDIT: The key line seems to be around:
gate, val = ff_in(x).chunk(2, dim=-1)
and related code, plus the lines "Notice: d_model = 36 with n_heads = 18 gives exactly 2D per head" but, again, this is very unclear and non-standard.> lookup tables anymore, that was never a good analogy in the first place
good analogy otherwise, wasn't hash tables the motivation for the kv tables?
Treating attention as a lookup operation is popular among computational complexity theorists (e.g. https://arxiv.org/abs/2310.03817 ) because it's easier to work with when you're explicitly constructing a transformer to perform a particular computation, just to demonstrate that tranformers can, in theory, perform it. That's also why there are no training details: the weights are computed directly and not trained.