It's complicated. Opus 4.5 is actually not that good at the 80% threshold but is above others at 50% threshold of completion. I read there's a single task around 16h that the model completed, and the broad CI comes from that.
METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.
It's complicated. Opus 4.5 is actually not that good at the 80% threshold but is above others at 50% threshold of completion. I read there's a single task around 16h that the model completed, and the broad CI comes from that.
METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.