Opus looks like a big jump from the previous leader (GPT 5.1), but when you switch from "50%" to "80%", GPT 5.1 still leads by a good margin. I'm not sure if you can take much from this - perhaps "5.1 is more reliable at slightly shorter stuff, choose Opus if you're trying to push the frontier in task length".
Yeah. 50% of the time to throw away expensive tokens and limits is not ideal. But I bet by this time next year OSS models will be at that capability!