Just by the (lack of) inter-model variance, I don't think SWEBench-Pro does a very good job of ...

jaen • today at 5:33 PM • 0 replies • view on HN

Just by the (lack of) inter-model variance, I don't think SWEBench-Pro does a very good job of representing model capability. Terminal-Bench seems more challenging and separates the wheat from the chaff.

Also, *ops work, which in my experience can actually be more complicated than SWE is underrepresented there obviously.

alt Hacker News