The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.
No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.
No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.