There's a leaderboard that measures user experience, the "lmsys" Chatbot Arena Leaderboard ( https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard ). Main issue with it these days are that it kinda measures sycophancy and user preferred tone more than substance.
Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).