logoalt Hacker News

josalhoryesterday at 6:24 PM7 repliesview on HN

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!


Replies

verdvermyesterday at 6:28 PM

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

show 5 replies
minimaxiryesterday at 6:35 PM

Note that GPT 5.2 newly supports a "xhigh" reasoning level, which could explain the better benchmarks.

It'll be noteworthy to see the cost-per-task on ARC AGI v2.

show 2 replies
causalyesterday at 6:37 PM

That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.

show 2 replies
fuddleyesterday at 7:49 PM

I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.

show 1 reply
poormathskillsyesterday at 6:29 PM

For a minor version update (5.1 -> 5.2) that's a way bigger improvement than I would have guessed.

show 1 reply
catigulayesterday at 6:35 PM

Yes, but it's not good enough. They needed to surpass Opus 4.5.

show 1 reply
thinkingtoiletyesterday at 6:52 PM

Open AI has already been busted for getting benchmark information and training the models on that. At this point if you believe Sam Altman, I have a bridge to sell you.