In this case the code is public and you can see they are not cheating in that sense.

sanxiyn • today at 3:37 AM • 4 replies • view on HN

Replies

I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.

SchemaLoad • today at 3:41 AM

Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.

➕ show 1 reply

bberrry • today at 9:19 AM

They are definitely cheating, they have crafted prompts[1] that explain the game rules rather than have the model explore and learn.

1. https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...

➕ show 1 reply

DetroitThrow • today at 5:13 AM

The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.

Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...

alt Hacker News

Replies