logoalt Hacker News

PashaGoyesterday at 12:40 PM0 repliesview on HN

Unfortunately, model quality is not the only criterion for users, and often not even the most important one. Adoption is also driven by marketing, UX, integrations, pricing, ecosystem, and a lot of other non-benchmark factors.

Also, model providers are not interested to have their models compared head-to-head under identical conditions. And “Model A is better than Model B” is almost meaningless by itself. Better for what task? With what prompt? What inputs? What budget? What failure tolerance?

It would be nice to have a place where users could run their own benchmarks, define evaluation criteria for their actual use cases, and make those runs verifiable by others.