Out of curiosity, I examined the worst task:

vanuatu • today at 12:27 AM • 0 replies • view on HN

https://deepswe.datacurve.ai/data/trials/quill-shared-toolba...

It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem

I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)

alt Hacker News