Seconding this, the terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases, where x and y are are outputs from previous layers.
Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations and/or multiplicative interactions among a dimension-reduced representation.
Do you think the dimension reduction is necessary? Or is it just practical (due to current hardware scarcity)?
>the terms "Query" and "Value" are largely arbitrary and meaningless in practice
This is the most confusing thing about it imo. Those words all mean something but they're just more matrix multiplications. Nothing was being searched for.