> By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini's training time.
The message I replied to said "if I have some toy poorly optimized python example". I think it's safe to say that matmul & kernel optimisation is a bit beyond a small python example.