logoalt Hacker News

andoandolast Thursday at 2:13 AM1 replyview on HN

I find it really confusing as well. The analogy implies we have something like Q[K] = V

For one, I have no idea how this relates to the mathematical operations of calculating attention score, applying softmax and than doing dot product with the V matrix.

Second just conceptually I don't understand how this relates to the "a word looks up to how relevant it is to another word". So if you have "The cat eats his soup", "his" queries how it's important it is to cat. So is V just numerical result of the significance, like 0.99?

I dont think Im very stupid but after seeing a dozens of these, I am starting to wonder if anyone actually understands this conceptually


Replies

empiricuslast Thursday at 9:39 AM

Not sure how helpful it is, but: Words or concepts are represented as high-dim vectors. At high level, we could say each dimension is another concept like "dog"-ness or "complexity" or "color"-ness. The "a word looks up to how relevant it is to another word" is basically just relevance=distance=vector dot product. and the dot product can be distorted="some directions are more important" for one purpose or another(q/k/v matrixes distort the dot product). softmax is just a form of normalization (all sums to 1 = proper probability). The whole shebang works only because all pieces can be learned by gradient descent, otherwise it would be impossible to implement.