I don't think that kind of difference in benchmarks has any meaning at all. Your agentic coding tool and the task you are working on introduce a lot more "noise" than that small delta.
Also consider they are all overfitting on the benchmark itself so there might be that as well (which can go in either directions)
I consider the top models practically identical for coding applications (just personal experience with heavy use of both GPT5.2 and Opus 4.5).
Excited to see how this model compares in real applications. It's 1/5th of the price of top models!!