logoalt Hacker News

libraryofbabellast Thursday at 1:32 AM3 repliesview on HN

This is really useful, thanks. In my other (top-level) comment, I mentioned some vague dissatisfactions around how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way. The kernel methods treatment looks much more mathematically general and clean - although for that reason maybe less approachable without a math background. But as a recovering applied mathematician ultimately I much prefer a "here is a general form, now let's make some clear assumptions to make it specific" to a "here's some random matrices you have to combine in a particular way by murky analogy to human attention and databases."

I'll make a note to read up on kernels some more. Do you have any other reading recommendations for doing that?


Replies

Atheblast Thursday at 1:07 PM

> how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way.

Justin Johnson's lecture on Attention [1] mechanisms really helped me understand the concept of attention in transformers. In the lecture he goes through the history and and iterations of attention mechanisms, from CNNs and RNNs to Transformers, while keeping the notation coherent and you get to see how and when in the literature the QKV matrices appear. It's an hour long but it's IMO a must watch for anyone interested in the topic.

[1]: https://www.youtube.com/watch?v=YAgjfMR9R_M

vatsachaklast Thursday at 3:01 AM

https://arxiv.org/abs/2008.02217

They derive Q, K, V as a continuous analog of a hopfield network

ACCount37last Thursday at 1:25 PM

That's kind of how applied ML is most of the time.

The neat chain of "this is how the math of it works" is constructed after the fact once you dialed in something and proven that it works. If ever.