> I want to see good/interesting work where the model is going off and doing its thing for multiple hours without supervision.
I'd be hesitant to use that as a way to evaluate things. Different systems run at different speeds. I want to see how much it can get done before it breaks, in different scenarios.