How credible is this benchmark? does it correlated with others real world experience?

zzleeper • yesterday at 5:25 PM • 8 replies • view on HN

Replies

Given it was made by cognition (team behind devin flop) who now just got to wait out until claude and gpt5 basically do all of the work for them - not very. When you read about it, the framework is highly subjective. Which very quickly becomes a problem because its based on heuristics that probably change a bunch with a better code model.

➕ show 1 reply

vanuatu • yesterday at 6:00 PM

i worked on one of the benchmarks typically found in new model releases

this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)

Catloafdev • yesterday at 5:29 PM

It's a relatively new benchmark but from what I can tell it has serious cred behind it. I assume it will be picked up as part of the standard suite of CS-related benchmarks soon enough.

schipperai • yesterday at 6:29 PM

Cognition did well in documenting their approach [1].

TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.

[1]: https://x.com/cognition/status/2064061031912288715

CSMastermind • yesterday at 9:17 PM

DeepSWE is the benchmark you want to actually look out for. Only one that aligns with actual user reported results from trying the models.

➕ show 1 reply

emp17344 • yesterday at 5:29 PM

Seems like it literally popped up yesterday with the express purpose of building hype for this release.

➕ show 4 replies

shimman • today at 3:07 AM

It's an unacademic benchmark by a failed VC startup clawing for relevancy.

piphf • today at 5:23 PM

[dead]

alt Hacker News

Replies