logoalt Hacker News

p1esklast Thursday at 3:48 AM1 replyview on HN

I glanced at these links and it seems that all these attention variants still use QKV projections.

Do you see any issues with my interpretation of them?


Replies

D-Machinelast Thursday at 4:03 AM

Read the third link / review paper, it is not at all the case that all attention is based on QKV projections.

Your terms "sensitivity", "visibility", and "important" are too vague and lack any clear mathematical meaning, so IMO add nothing to any understanding. "Important" also seems factually wrong, given these layers are stacked, so later weights and operations can in fact inflate / reverse things. Deriving e.g. feature importances from self-attention layers remains a highly disputed area (e.g. [1] vs [2], for just the tip of the iceberg).

You are also assuming that the importance of attention is the highly-specific QKV structure and projection, but there is very little reason to believe that based on the third review link I shared. Or, if you'd like another example of why not to focus so much on scaled dot-product attention, see that it is just a subset of a broader category of multiplicative interactions (https://openreview.net/pdf?id=rylnK6VtDH).

[1] Attention is not Explanation - https://arxiv.org/abs/1902.10186

[2] Attention is not not Explanation - https://arxiv.org/abs/1908.04626

show 1 reply