People need to get away from this idea of Key/Query/Value as being special. Whereas a st...

ActorNightly • yesterday at 8:56 PM • 1 reply • view on HN

People need to get away from this idea of Key/Query/Value as being special.

Whereas a standard deep layer in a network is matrix * input, where each row of the matrix is the weights of the particular neuron in the next layer, a transformer is basically input* MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just simply more dimensions in a layer.

And consequently, you can represent the entire transformer architecture with a set of deep layers as you unroll the matricies, with a lot of zeros for the multiplication pieces that are not needed.

This is a fairly complex blog but it shows that its just all matrix multiplication all the way down. https://pytorch.org/blog/inside-the-matrix/.

Replies

throw310822 • yesterday at 9:11 PM

I might be completely off road, but I can't help thinking of convolutions as my mental model for the K Q V mechanism. Attention has the same property of a convolution kernel of being trained independently of position; it learns how to translate a large, rolling portion of an input to a new "digested" value; and you can train multiple ones in parallel so that they learn to focus on different aspects of the input ("kernels" in the case of convolution, "heads" in the case of attention).

➕ show 1 reply

alt Hacker News

Replies