Very cool! So glad to see people building and sharing evals that are better than SWE bench. I'...

tedsanders • yesterday at 10:03 PM • 1 reply • view on HN

Very cool! So glad to see people building and sharing evals that are better than SWE bench.

I'm curious - any particular reason you didn't put error bars on the graphs? Seems like it could be helpful when there are only 50 unique problems in the diamond set.

Replies

swyx • yesterday at 10:14 PM

*50 unique problems but 20-40 rubrics per problem (something I had to keep reminding people internally who were unimpressed with the N)

simple answer is our reporting was pass@5. feel like you'd need like 50+ runs to have reasonable confidence intervals, which somehow i dont see other people do, so i also didnt insist on it.

hoping to work with <prominent third party evals shop> to get this on their infra and evaluated along with whatever the industry standard is.

➕ show 1 reply

alt Hacker News

Replies