From GPT 5.1 Thinking: ARC AGI v2: 17.6% -> 52.9% SWE Verified: 76.3% -> 80% That's...

josalhor • yesterday at 6:24 PM • 7 replies • view on HN

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

Replies

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

➕ show 5 replies

minimaxir • yesterday at 6:35 PM

Note that GPT 5.2 newly supports a "xhigh" reasoning level, which could explain the better benchmarks.

It'll be noteworthy to see the cost-per-task on ARC AGI v2.

➕ show 2 replies

causal • yesterday at 6:37 PM

That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.

➕ show 2 replies

fuddle • yesterday at 7:49 PM

I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.

➕ show 1 reply

poormathskills • yesterday at 6:29 PM

For a minor version update (5.1 -> 5.2) that's a way bigger improvement than I would have guessed.

➕ show 1 reply

catigula • yesterday at 6:35 PM

Yes, but it's not good enough. They needed to surpass Opus 4.5.

➕ show 1 reply

thinkingtoilet • yesterday at 6:52 PM

Open AI has already been busted for getting benchmark information and training the models on that. At this point if you believe Sam Altman, I have a bridge to sell you.

alt Hacker News

Replies