logoalt Hacker News

hnfongtoday at 4:07 AM0 repliesview on HN

There's a leaderboard that measures user experience, the "lmsys" Chatbot Arena Leaderboard ( https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard ). Main issue with it these days are that it kinda measures sycophancy and user preferred tone more than substance.

Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).