logoalt Hacker News

papersailtoday at 1:00 PM1 replyview on HN

I'm not sure I would put too much weight on DeepSWE as a benchmark, given that GPT-5.4-mini ended up close to Opus 4.6 there.


Replies

DCKingtoday at 1:13 PM

Any benchmark is iffy and has weird results, but this is the best we got at the moment. Most people working with Opus and Kimi would likely tell you they're much further apart than the numbers that were quoted for Kimi K2.6, and DeepSWE seems to capture that gap better.

One major thing DeepSWE has going for it is that all other benchmarks (including those quoted by MoonshotAI on this page) don't: the other benchmarks that are completely gamed. The benchmark answers are public and part of each model's training data. This benchmark may still be iffy, but at least it's not gamed.

show 1 reply