Don't get caught up in interpreting QKV, it is a waste of time, since completely different atte...

D-Machine • last Thursday at 2:34 AM • 1 reply • view on HN

Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].

[1] https://www.emergentmind.com/topics/merged-attention

[2] https://blog.google/innovation-and-ai/technology/developers-...

[3] https://arxiv.org/abs/2111.07624

Replies

p1esk • last Thursday at 3:48 AM

I glanced at these links and it seems that all these attention variants still use QKV projections.

Do you see any issues with my interpretation of them?

➕ show 1 reply

alt Hacker News

Replies