Careful with that benchmark. It's LLMs grading other LLMs.

ranyume • last Sunday at 6:47 PM • 1 reply • view on HN

Replies

Well if lmsys showed anything, it's that human judges are measurably worse. Then you have your run of the mill multiple choice tests that grade models on unrealistic single token outputs. What does that leave us with?

➕ show 2 replies

alt Hacker News

Replies