> The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository.
This seems like a really cool thing to benchmark! Technically it'd be possible to take GitHub repos that the AI orgs probably already have, cross-reference the code against the issues and regressions, and train/validate on that.
The dataset would need to be way bigger to get close to the likes of SWE-bench: https://www.swebench.com/original.html
"Vibe coded stuff gets hard to maintain and will end up buggy." Yeah, so make models that deal with that better, optimize for maintainability and consistency.
Cool to see Claude doing decently though!
> Cool to see Claude doing decently though!
The scales do seem to be tipped in its favor (cf: my other comment in this thread).