I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.
What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.
I share this sense, but my immediate thought is that we need to improve the evaluations! Do you think this is impossible? That there is something indelible that it is not possible to capture empirically? I kind of have this intuitive sense that it is this way, but simultaneously I think that it's unlikely to really be true.
What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.
Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?
With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.