logoalt Hacker News

zzleeperyesterday at 5:25 PM8 repliesview on HN

How credible is this benchmark? does it correlated with others real world experience?


Replies

bfeynmanyesterday at 5:59 PM

Given it was made by cognition (team behind devin flop) who now just got to wait out until claude and gpt5 basically do all of the work for them - not very. When you read about it, the framework is highly subjective. Which very quickly becomes a problem because its based on heuristics that probably change a bunch with a better code model.

show 1 reply
vanuatuyesterday at 6:00 PM

i worked on one of the benchmarks typically found in new model releases

this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)

Catloafdevyesterday at 5:29 PM

It's a relatively new benchmark but from what I can tell it has serious cred behind it. I assume it will be picked up as part of the standard suite of CS-related benchmarks soon enough.

schipperaiyesterday at 6:29 PM

Cognition did well in documenting their approach [1].

TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.

[1]: https://x.com/cognition/status/2064061031912288715

CSMastermindyesterday at 9:17 PM

DeepSWE is the benchmark you want to actually look out for. Only one that aligns with actual user reported results from trying the models.

show 1 reply
emp17344yesterday at 5:29 PM

Seems like it literally popped up yesterday with the express purpose of building hype for this release.

show 4 replies
shimmantoday at 3:07 AM

It's an unacademic benchmark by a failed VC startup clawing for relevancy.

piphftoday at 5:23 PM

[dead]