I've been A/B testing the big three (GPT-5, Claude Opus 4, Gemini 3.1) on a real codebase migration this week.
Quick take: Gemini 3.1 Pro's long context is genuinely better now — I fed it a 200k token codebase and it could reference files from the beginning without losing track. That was a real problem in 3.0.
For pure code generation though, Claude still edges it out on following complex multi-step instructions. Gemini tends to take shortcuts when the task has more than ~5 constraints.
The exciting thing is how close they all are. Competition is working exactly as it should.