logoalt Hacker News

moffkalasttoday at 5:41 PM3 repliesview on HN

Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.


Replies

culitoday at 7:02 PM

Alibaba maintains its own separate version of lm-arena where the prompts are fixed and you simply judge the outputs

https://aiarena.alibaba-inc.com/corpora/arena/leaderboard

nabakintoday at 6:00 PM

It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too.

show 1 reply
jugtoday at 6:03 PM

I agree; LMArena died for me with the Llama 4 debacle. And not only the gamed scores, but seeing with shock and horror the answers people found good. It does test something though: the general "vibe" and how human/friendly and knowledgeable it _seems_ to be.