Issue with LLM benchmarks is similar to cars’ benchmarks. Eg journalists almost always get the full equipped model so their review is honest but sort of rigged.
I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.
One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.
If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.
Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …
it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)
Even with the same model I get different answers to same prompt that is just tweaked a little.
So benchmarks are nice but mostly useless.
Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.