So if I try to do it with Opus three or four times, I'll get it done? And probably in about 10 minutes? Awesome
Nope, these are no random dice rolls. Some times are solved each run, a few - occasionally (so here would be meaningful to try a few times - and metrics of pass@1 and pass@3 would be different), but most are never solved.
See e.g.: https://quesma.com/benchmarks/otel/models/claude-opus-4.5/
That’s only if the failures are truly random and aren’t correlated
Nope, these are no random dice rolls. Some times are solved each run, a few - occasionally (so here would be meaningful to try a few times - and metrics of pass@1 and pass@3 would be different), but most are never solved.
See e.g.: https://quesma.com/benchmarks/otel/models/claude-opus-4.5/