I'm thinking reading numbers like this is really just slop lately.
FA achieving a 32.5% speed up? Cool.
Why not submit it as a PR to the Flash Attention repo then? Can I read about it more in detail?
Exactly, as a great dev once said: "talk is cheap, show me the code"
I assume the Gemini results are JAX/PAX-ML/Pallas improvements for TPUs so would look there for recent PRs
I have not read this linked article, but your comment made me recall a discussion about a speed up of CUDA kernels presented by Sakana AI Labs. The researcher Ravid Shwartz Ziv at NYU posted about it on LinkedIn [1], and here is the Twitter post of interest [2]
""" Yesterday's news about Sakana AI Labs provided an important lesson for all of us working with AI agents. Their announcement of an AI system that could supposedly optimize CUDA kernels to run 100x faster initially seemed like exactly the kind of use cases we've been hoping for in AI-assisted development.
Like many others, I was excited about it. After all, isn't this exactly what we want AI to do - help us optimize and improve our technical systems?
However, careful investigation by the community (on Twitter) revealed a different story. What really happened? The AI-generated CUDA kernel appeared to achieve incredible speedups, but the code was inadvertently reusing memory buffers containing previous results, essentially bypassing the actual computation. When properly evaluated, the kernel actually runs about 3x slower than the baseline. """
[1] https://www.linkedin.com/posts/ravid-shwartz-ziv-8bb18761_ye...
[2] https://x.com/main_horse/status/1892473238036631908