Benchmarks for LLMs without complete information about the tested models are hard to interpret.
For the OpenAI and Anthropic models, it is clear that they have been run by their owners, but for the other models there are a great number of options for running them, which may run the full models or only quantized variants, with very different performances.
For instance, in the model list there are both "moonshotai/kimi-k2.6" and "kimi-k2.6", with very different results, but there is no information about which is the difference between these 2 labels, which refer to the same LLM.
Moreover, as others have said, such a benchmark does not prove that a certain cheaper model cannot solve a problem. It happened to not solve it within the benchmark, but running it multiple times, possibly with adjusted prompts, may still solve the problem.
While for commercial models running them many times can be too expensive, when you run a LLM locally you can afford to run it much more times than when you are afraid of the token price or of reaching the subscription limits.
Agreed. But, at least as of yesterday, dsv4 was only served by deepseek. And, more importantly, that's what the "average" experience would be if you'd setup something easy like openrouter. Sure, with proper tuning and so on you can be sure you're getting the model at its best. But are you, if you just setup openrouter and go brrr? Maybe. Maybe not.