Seconding this, the terms "Query" and "Value" are largely arbitrary and meaningl...

D-Machine • yesterday at 7:52 PM • 2 replies • view on HN

Seconding this, the terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases, where x and y are are outputs from previous layers.

Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations and/or multiplicative interactions among a dimension-reduced representation.

Replies

tayo42 • today at 2:45 AM

>the terms "Query" and "Value" are largely arbitrary and meaningless in practice

This is the most confusing thing about it imo. Those words all mean something but they're just more matrix multiplications. Nothing was being searched for.

➕ show 1 reply

profsummergig • yesterday at 8:06 PM

Do you think the dimension reduction is necessary? Or is it just practical (due to current hardware scarcity)?

➕ show 1 reply

alt Hacker News

Replies