Careful with that benchmark. It's LLMs grading other LLMs.
Well if lmsys showed anything, it's that human judges are measurably worse. Then you have your run of the mill multiple choice tests that grade models on unrealistic single token outputs. What does that leave us with?
Well if lmsys showed anything, it's that human judges are measurably worse. Then you have your run of the mill multiple choice tests that grade models on unrealistic single token outputs. What does that leave us with?