> A record number of timeouts. Fable 5's extended thinking caused more per-instance timeouts...

gwern • today at 7:48 PM • 2 replies • view on HN

> A record number of timeouts. Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. ... Highest cheating volume. We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. ... Four hall-of-fame firsts. Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall.

All of this points to their claim of 'average' as being heavily biased downwards. A model being so up to date and large-parameter it's memorized solutions to your problems is not a knock against it (but rather, a knock against your benchmark being valid), and why should timeouts (especially for a model just launched) be counted at all?

Replies

Aurornis • today at 8:11 PM

I agree. This article could have been an interesting read about how coding benchmarks are hard and a constantly moving target, but instead they anchored to a belief that their benchmark is correct.

I can't shake the feeling that they knew which headline would generate the most shares and wrote the article to fit instead of acknowledging where they went wrong.

anematode • today at 7:59 PM

> memorization of upstream fixes from training data

At least now we have up-to-date evidence on their laundering, and the fact that regurgitation absolutely still happens.

alt Hacker News

Replies