logoalt Hacker News

D-Machinelast Thursday at 2:34 AM1 replyview on HN

Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].

[1] https://www.emergentmind.com/topics/merged-attention

[2] https://blog.google/innovation-and-ai/technology/developers-...

[3] https://arxiv.org/abs/2111.07624


Replies

p1esklast Thursday at 3:48 AM

I glanced at these links and it seems that all these attention variants still use QKV projections.

Do you see any issues with my interpretation of them?

show 1 reply