logoalt Hacker News

nubgyesterday at 8:02 PM1 replyview on HN

Any benchmarks?


Replies

gordonhartyesterday at 8:10 PM

The main frontier models are all up on https://arcprize.org/tasks

Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%

show 2 replies