I've been A/B testing the big three (GPT-5, Claude Opus 4, Gemini 3.1) on a real codebase ...

MATTEHWHOU • today at 9:05 PM • 0 replies • view on HN

I've been A/B testing the big three (GPT-5, Claude Opus 4, Gemini 3.1) on a real codebase migration this week.

Quick take: Gemini 3.1 Pro's long context is genuinely better now — I fed it a 200k token codebase and it could reference files from the beginning without losing track. That was a real problem in 3.0.

For pure code generation though, Claude still edges it out on following complex multi-step instructions. Gemini tends to take shortcuts when the task has more than ~5 constraints.

The exciting thing is how close they all are. Competition is working exactly as it should.

alt Hacker News