I published a video that explains Self-Attention and Multi-head attention in a different way -- going from intuition, to math, to code starting from the end-result and walking backward to the actual method.
Hopefully this sheds light on this important topic in a way that is different than other approaches and provides the clarity needed to understand Transformer architecture. It starts at 41:22 in the below video.