Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont...

ripbozo • yesterday at 4:29 PM • 4 replies • view on HN

Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

Replies

maxall4 • yesterday at 4:52 PM

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

➕ show 1 reply

boplicity • yesterday at 4:53 PM

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.

energy123 • yesterday at 6:16 PM

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

➕ show 2 replies

blinding-streak • yesterday at 4:43 PM

I assume all the frontier models are benchmaxxing, so it would make sense

alt Hacker News

Replies