Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].
[1] https://www.emergentmind.com/topics/merged-attention
[2] https://blog.google/innovation-and-ai/technology/developers-...
I glanced at these links and it seems that all these attention variants still use QKV projections.
Do you see any issues with my interpretation of them?