I might be completely off road, but I can't help thinking of convolutions as my mental model for the K Q V mechanism. Attention has the same property of a convolution kernel of being trained independently of position; it learns how to translate a large, rolling portion of an input to a new "digested" value; and you can train multiple ones in parallel so that they learn to focus on different aspects of the input ("kernels" in the case of convolution, "heads" in the case of attention).
I think there are two key differences though: 1) Attention doesn't doesn't use fixed distance-dependent weight for the aggregation but instead the weight becomes "semantically-dependent", based on association between q/k. 2) A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation, pulling from the hidden states of all previous tokens. (Maybe sliding window attention schemes muddy this distinction, but in general the degree of connectivity seems far higher).
There might be some unifying way to look at things though, maybe GNNs. I found this talk [1] and at 4:17 it shows how convolution and attention would be modeled in a GNN formalism
[1] https://www.youtube.com/watch?v=J1YCdVogd14