> I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that...

nikcub • yesterday at 10:54 PM • 2 replies • view on HN

> I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score.

assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link

Replies

simsla • yesterday at 11:56 PM

Probably, but with a small sample size like that, they should probably be taking the uncertainty into account, because I wouldn't be surprised if a lot of this variation falls within expected noise.

E.g. some binomial interval proportions (aka confidence intervals).

phist_mcgee • yesterday at 10:57 PM

Then you'd get people claiming that the benchmarks were 'paid for' by anthropic

➕ show 1 reply

alt Hacker News

Replies