This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).
I did not say that I only ran the prompt once per attempt. When I say that second time it failed it means that I spent hours to restart, clear context, giving hints, everything to help the model to produce something that works.
I was pretty disappointed to learn that the METR metric isn't actually evaluating a model's ability to complete long duration tasks. They're using the estimated time a human would take on a given task. But it did explain my increasing bafflement at how the METR line keeps steadily going up despite my personal experience coding daily with LLMs where they still frequently struggle to work independently for 10 minutes without veering off task after hitting a minor roadblock.
[1] https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...