It's also that the price-value frontier is different for different use-cases. For many of the things that I was doing, I could do harness improvements to make DS V4 Flash catch up in performance with GPT-5.5 or Claude Sonnet, but that's just because of the use-case. And if I'm being honest, this kind of eval doesn't need someone else. Claude and I can build a framework on a per-task thing.