logoalt Hacker News

ofirpressyesterday at 3:16 PM10 repliesview on HN

[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.


Replies

Davidzhengyesterday at 3:43 PM

but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)

show 3 replies
nikcubyesterday at 10:54 PM

> I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score.

assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link

show 2 replies
seunosewayesterday at 4:22 PM

The degradation may be more significant within the day than at the same time every day.

show 1 reply
mohsen1yesterday at 3:18 PM

Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

https://mafia-arena.com

show 1 reply
rootnod3yesterday at 4:43 PM

Sorry what?

"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?

"Oh, you just measured me at bad times each day. On only 50 different queries."

So, what does that mean? I have to pick specific times during the day for Claude to code better?

Does Claude Code have office hours basically?

show 4 replies
bhkyesterday at 7:05 PM

According to Anthropic: "We never reduce model quality due to demand, time of day, or server load."

https://www.anthropic.com/engineering/a-postmortem-of-three-...

show 1 reply
epolanskiyesterday at 3:50 PM

Stilll relevant over time.

cedwsyesterday at 3:38 PM

Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.

show 2 replies
chrisjjyesterday at 4:24 PM

> Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

Are you suggesting result accuracy varies with server load?

dana321yesterday at 3:55 PM

"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"

Aha, so the models do degrade under load.