logoalt Hacker News

ranyumelast Sunday at 6:47 PM1 replyview on HN

Careful with that benchmark. It's LLMs grading other LLMs.


Replies

moffkalastlast Sunday at 6:53 PM

Well if lmsys showed anything, it's that human judges are measurably worse. Then you have your run of the mill multiple choice tests that grade models on unrealistic single token outputs. What does that leave us with?

show 2 replies