I'm not following the whole LLM space, but
> the compute needed to perform matrix multiplications goes up as the cube of their size,
are they really not using even Strassen multiplication?
AFAIK the best practical matrix multiplication algorithms scale as roughly N^2.7 which is close enough to N^3 to not matter for the point that I'm trying to make.
I'm not aware of any major BLAS library that uses Strassen's algorithm. There's a few reasons for this; one of the big ones is Strassen is much worse numerical performance than traditional matrix multiplication. Another big one is that at very large dense matrices--which are using various flavors of parallel algorithms--Strassen vastly increases the communication overhead. Not to mention that the largest matrices are probably using sparse matrix arithmetic anyways, which is a whole different set of algorithms.