[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they o...

ofirpress • yesterday at 3:16 PM • 10 replies • view on HN

[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

Replies

Davidzheng • yesterday at 3:43 PM

but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)

➕ show 3 replies

nikcub • yesterday at 10:54 PM

> I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score.

assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link

➕ show 2 replies

seunosewa • yesterday at 4:22 PM

The degradation may be more significant within the day than at the same time every day.

➕ show 1 reply

mohsen1 • yesterday at 3:18 PM

Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

https://mafia-arena.com

➕ show 1 reply

rootnod3 • yesterday at 4:43 PM

Sorry what?

"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?

"Oh, you just measured me at bad times each day. On only 50 different queries."

So, what does that mean? I have to pick specific times during the day for Claude to code better?

Does Claude Code have office hours basically?

➕ show 4 replies

bhk • yesterday at 7:05 PM

According to Anthropic: "We never reduce model quality due to demand, time of day, or server load."

https://www.anthropic.com/engineering/a-postmortem-of-three-...

➕ show 1 reply

epolanski • yesterday at 3:50 PM

Stilll relevant over time.

cedws • yesterday at 3:38 PM

Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.

➕ show 2 replies

chrisjj • yesterday at 4:24 PM

> Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

Are you suggesting result accuracy varies with server load?

dana321 • yesterday at 3:55 PM

"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"

Aha, so the models do degrade under load.

alt Hacker News

Replies