[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
> I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score.
assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link
The degradation may be more significant within the day than at the same time every day.
Hope you don't mind the unrelated question:
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
Sorry what?
"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?
"Oh, you just measured me at bad times each day. On only 50 different queries."
So, what does that mean? I have to pick specific times during the day for Claude to code better?
Does Claude Code have office hours basically?
According to Anthropic: "We never reduce model quality due to demand, time of day, or server load."
https://www.anthropic.com/engineering/a-postmortem-of-three-...
Stilll relevant over time.
Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.
> Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
Are you suggesting result accuracy varies with server load?
"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"
Aha, so the models do degrade under load.
but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)