Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.
It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too.
I agree; LMArena died for me with the Llama 4 debacle. And not only the gamed scores, but seeing with shock and horror the answers people found good. It does test something though: the general "vibe" and how human/friendly and knowledgeable it _seems_ to be.
Alibaba maintains its own separate version of lm-arena where the prompts are fixed and you simply judge the outputs
https://aiarena.alibaba-inc.com/corpora/arena/leaderboard