The way I think about QKV projections: Q defines sensitivity of token i features when computing simi...

p1esk • last Thursday at 1:19 AM • 1 reply • view on HN

The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.

Replies

D-Machine • last Thursday at 2:34 AM

Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].

[1] https://www.emergentmind.com/topics/merged-attention

[2] https://blog.google/innovation-and-ai/technology/developers-...

[3] https://arxiv.org/abs/2111.07624

➕ show 1 reply

alt Hacker News

Replies