logoalt Hacker News

vanuatutoday at 12:16 AM1 replyview on HN

This benchmark matches my experience with GPT (I occasionally go back to Claude when I run into limits and frequently run into forgotten requirements and reward hacking)

I do have two questions / critiques:

- The verifier doesn't seem to check for code quality / maintainability, which I would posit is one of the major qualms with SOTA coding models i.e. they lack code 'taste'. Ofc this is a difficult problem to solve at scale, but wanted to point that out nonetheless

- This almost feels written like a critique on SWE Bench Pro. Hopefully they fix the issues with that benchmark!


Replies

vanuatutoday at 12:27 AM

Out of curiosity, I examined the worst task:

https://deepswe.datacurve.ai/data/trials/quill-shared-toolba...

It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem

I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)