No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.
This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).
This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).