I'm not sure I would put too much weight on DeepSWE as a benchmark, given that GPT-5.4-mini end...

papersail • today at 1:00 PM • 1 reply • view on HN

I'm not sure I would put too much weight on DeepSWE as a benchmark, given that GPT-5.4-mini ended up close to Opus 4.6 there.

Replies

DCKing • today at 1:13 PM

Any benchmark is iffy and has weird results, but this is the best we got at the moment. Most people working with Opus and Kimi would likely tell you they're much further apart than the numbers that were quoted for Kimi K2.6, and DeepSWE seems to capture that gap better.

One major thing DeepSWE has going for it is that all other benchmarks (including those quoted by MoonshotAI on this page) don't: the other benchmarks that are completely gamed. The benchmark answers are public and part of each model's training data. This benchmark may still be iffy, but at least it's not gamed.

➕ show 1 reply

alt Hacker News

Replies