logoalt Hacker News

jmyetoday at 12:37 AM1 replyview on HN

> I'm not sure how groundbreaking the main insight is.

I think it likely is groundbreaking for a number of people (especially non-tech CTOs and VPs) who make decisions based on these benchmarks and who have never wondered what the scores are actually scoring.


Replies

mzellingtoday at 2:58 AM

I'm not sure if the paper's findings are all that actionable. The paper doesn't say "here's how benchmarks are currently being gamed." It says "here's how benchmarks could in theory be gamed."

Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.

show 1 reply