From GPT 5.1 Thinking:
ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
Note that GPT 5.2 newly supports a "xhigh" reasoning level, which could explain the better benchmarks.
It'll be noteworthy to see the cost-per-task on ARC AGI v2.
That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.
I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.
For a minor version update (5.1 -> 5.2) that's a way bigger improvement than I would have guessed.
Yes, but it's not good enough. They needed to surpass Opus 4.5.
Open AI has already been busted for getting benchmark information and training the models on that. At this point if you believe Sam Altman, I have a bridge to sell you.
We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day