The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.
If you and others have any insights to share on structuring that benchmark, I'm all ears.
There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.
The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.