More information on OpenAI's result (which seems better than DeepMind's) from the X thread...

modeless • yesterday at 7:05 PM • 2 replies • view on HN

More information on OpenAI's result (which seems better than DeepMind's) from the X thread:

> our OpenAI reasoning system got a perfect score of 12/12

> For 11 of the 12 problems, the system’s first answer was correct. For the hardest problem, it succeeded on the 9th submission. Notably, the best human team achieved 11/12.

> We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.

I'm assuming that "GPT-5" here is a version with the same model weights but higher compute limits than even GPT-5 Pro, with many instances working in parallel, and some specific scaffolding and prompts. Still, extremely impressive to outperform the best human team. The stat I'd really like to see is how much money it would cost to get this result using their API (with a realistic cost for the "experimental reasoning model").

Replies

bazmattaz • yesterday at 7:39 PM

Ha so true. I was so tempted to copy and paste a problem into GPT5 and see what it would say

➕ show 1 reply

qwertox • yesterday at 11:23 PM

> it succeeded on the 9th submission

What's the judgement here? Was it within the allotted time, or just a "try as often as you need to"?

➕ show 1 reply

alt Hacker News

Replies