> As we're talking about google here, we can be pretty sure it wasn't some trivial python trick that they missed.
Strong disagree on the reasoning here. Especially since google is big and have thousands of developers, there could be a lot of code and a lot of low hanging fruit.
> By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini's training time.
The message I replied to said "if I have some toy poorly optimized python example". I think it's safe to say that matmul & kernel optimisation is a bit beyond a small python example.