IIRC isn't the symmetry between Q and K also broken by the direction of the softmax? I mean, row vs column-wise application yields different interpretation.
Yes but in practice, if you compute K=X.wk, Q=X.wq and then K.tQ you make three matrice multiplication.
Wouldn't be faster to compute W=wk.twq beforhand and then just X.W.tX which will be just two matrices multiplication ?
Is there something I am missing ?
Yes but in practice, if you compute K=X.wk, Q=X.wq and then K.tQ you make three matrice multiplication. Wouldn't be faster to compute W=wk.twq beforhand and then just X.W.tX which will be just two matrices multiplication ? Is there something I am missing ?