I agree with your sentiment, this incremental evolution is getting difficult to feel when working with code, especially with large enterprise codebases. I would say that for the vast majority of tasks there is a much bigger gap on tooling than on foundational model capability.
Also came to say the same thing. When Gemini 3 came out several people asked me "Is it better than Opus 4.1?" but I could no longer answer it. It's too hard to evaluate consistently across a range of tasks.