After spending many hours optimizing some routines I now think performance optimization is a great benchmark for identifiying how generally smart an AI is at helping with some specific piece of code.
Solutions are quite easy to verify with differential testing and produce a number for direct comparison.
Less code is usually better and you generally can't "cheat" by adding more cruft so it nullifies the additive bias. Good optimization requires significant understanding of the underlying structures. Everything has performance tradeoffs so it requires systemic thinking and not just stringing independent pieces together.
So far I've found that Gemini Pro 3 was the best at reasoning about tricky SIMD code but the results with most models were pretty underwhelming.