logoalt Hacker News

apetrescyesterday at 6:03 PM2 repliesview on HN

Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).

But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.


Replies

epolanskiyesterday at 8:09 PM

From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.

I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.

4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.

SubiculumCodeyesterday at 6:08 PM

Isn't SWE-Bench Verified pretty saturated by now?

show 1 reply