logoalt Hacker News

operatingthetanyesterday at 6:28 PM1 replyview on HN

Would creating new benchmarks every month solve this problem?


Replies

preciousooyesterday at 6:44 PM

Or create "blind" benchmarks.

10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).

that's 10 different tests. Aggregate pass rates