Unfortunately, model quality is not the only criterion for users, and often not even the most important one. Adoption is also driven by marketing, UX, integrations, pricing, ecosystem, and a lot of other non-benchmark factors.
Also, model providers are not interested to have their models compared head-to-head under identical conditions. And “Model A is better than Model B” is almost meaningless by itself. Better for what task? With what prompt? What inputs? What budget? What failure tolerance?
It would be nice to have a place where users could run their own benchmarks, define evaluation criteria for their actual use cases, and make those runs verifiable by others.