logoalt Hacker News

MontyCarloHalllast Thursday at 2:13 AM2 repliesview on HN

The confusing thing about attention in this article (and the famous "Attention is all you need" paper it's derived from) is the heavy focus on self-attention. In self-attention, Q/K/V are all derived from the same input tokens, so it's confusing to distinguish their respective purposes.

I find attention much easier to understand in the original attention paper [0], which focuses on cross-attention for machine translation. In translation, the input sentence to be translated is tokenized into vectors {x_1...x_n}. The translated sentence is autoregressively generated into tokens {y_1...y_m}. To generate y_j, the model computes a similarity score of the previously generated token y_{j-1} against every x_i via the dot product s_{i,j} = x_i*K*y_{j-1}, transformed by the Key matrix. These are then softmaxed to create a weight vector a_j = softmax_i(s_{i,j}). The weighted average of X = [x_1|...|x_n] is taken with respect to a_j and transformed by the Value matrix, i.e. c_j = V*X*a_j. c_j is then passed to additional network layers to generate the output token y_j.

tl;dr: given the previous output token, compute its similarity to each input token (via K). Use those similarity scores to compute a weighted average across all input tokens, and use that weighted average to generate the next output token (via V).

Note that in this paper, the Query matrix is not explicitly used. It can be thought of as a token preprocessor: rather than computing s_{i,j} = x_i*K*y_{j-1}, each x_i is first linearly transformed by some matrix Q. Because this paper used an RNN (specifically, an LSTM) to encode the tokens, such transformations on the input tokens are implicit in each LSTM module.

[0] https://arxiv.org/pdf/1508.04025 (predates "Attention is all you need" by 3 years)


Replies

D-Machinelast Thursday at 2:17 AM

Very much this, cross attention and the x, y notation makes the similarity / covariance matrix far more clear and intuitive.

Also forget the terms "query", "key" and "value", or vague analogies to key-value stores, that is IMO a largely false analogy, and certainly not a helpful way to understand what is happening.

show 1 reply
kavalglast Thursday at 4:42 PM

Isn't the Bahdanau attention even earlier[0]?

[0] https://arxiv.org/abs/1409.0473