If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.
You can’t benchmaxx an eval that comes after your model release.
Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.
Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.
Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.
> You can’t benchmaxx an eval that comes after your model release
Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.