This is ok (could use some diagrams!), but I don't think anyone coming to this for the first time will be able to use it to really teach themselves the LLM attention mechanism. It's a hard topic and requires two or three book chapters at least if you really want to start grokking it!
For anyone serious about coming to grips with this stuff, I would strongly recommend Sebastian Raschka's excellent book Build a Large Language Model (From Scratch), which I just finished reading. It's approachable and also detailed.
As an aside, does anyone else find the whole "database lookup" motivation for QKV kind of confusing? (in the article, "Query (Q): What am I looking for? Key (K): What do I contain? Value (V): What information do I actually hold?"). I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix. After all, the actual contents and "meaning" of QKV are very opaque: the weights that are used to construct them are learned during training. Furthermore, there is a lot of symmetry between Q and K in the algebra, which gets broken only by the causal mask. Or do people find this motivation useful and meaningful in some deeper way? What am I missing?
[edit: on this last question, the article on "Attention is just Kernel Smoothing" that roadside_picnic posted below looks really interesting in terms of giving a clean generalized mathematical approach to this, and also affirms that I'm not completely off the mark by being a bit suspicious about the whole hand-wavy "database lookup" Queries/Keys/Values interpretation]
I'm not a fan of the database lookup analogy either.
The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.
The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.
IIRC isn't the symmetry between Q and K also broken by the direction of the softmax? I mean, row vs column-wise application yields different interpretation.
I find it really confusing as well. The analogy implies we have something like Q[K] = V
For one, I have no idea how this relates to the mathematical operations of calculating attention score, applying softmax and than doing dot product with the V matrix.
Second just conceptually I don't understand how this relates to the "a word looks up to how relevant it is to another word". So if you have "The cat eats his soup", "his" queries how it's important it is to cat. So is V just numerical result of the significance, like 0.99?
I dont think Im very stupid but after seeing a dozens of these, I am starting to wonder if anyone actually understands this conceptually
Does that book require some sort of technical prerequisite to understand?
> I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix.
That's because what you say here is the correct understanding. The lookup thing is nonsense.
The terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases (e.g. cross-attention), where x and y are are outputs from previous layers.
Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations/similarities and/or multiplicative interactions among a dimension-reduced representation. EDIT: Or, as you say, it can be regarded as kernel smoothing.