logoalt Hacker News

tedsandersyesterday at 10:03 PM1 replyview on HN

Very cool! So glad to see people building and sharing evals that are better than SWE bench.

I'm curious - any particular reason you didn't put error bars on the graphs? Seems like it could be helpful when there are only 50 unique problems in the diamond set.


Replies

swyxyesterday at 10:14 PM

*50 unique problems but 20-40 rubrics per problem (something I had to keep reminding people internally who were unimpressed with the N)

simple answer is our reporting was pass@5. feel like you'd need like 50+ runs to have reasonable confidence intervals, which somehow i dont see other people do, so i also didnt insist on it.

hoping to work with <prominent third party evals shop> to get this on their infra and evaluated along with whatever the industry standard is.

show 1 reply