Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People ar...

moffkalast • today at 5:41 PM • 3 replies • view on HN

Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.

Replies

culi • today at 7:02 PM

Alibaba maintains its own separate version of lm-arena where the prompts are fixed and you simply judge the outputs

https://aiarena.alibaba-inc.com/corpora/arena/leaderboard

nabakin • today at 6:00 PM

It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too.

➕ show 1 reply

jug • today at 6:03 PM

I agree; LMArena died for me with the Llama 4 debacle. And not only the gamed scores, but seeing with shock and horror the answers people found good. It does test something though: the general "vibe" and how human/friendly and knowledgeable it _seems_ to be.

alt Hacker News

Replies