SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactl...

devnotes77 • today at 12:03 AM • 1 reply • view on HN

SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactly that. Real codebases have style constraints, architecture choices, and maintainability concerns that don't show up in any test suite. Not surprised at all that the PRs wouldn't get merged; you'd expect that from an eval that can't measure what reviewers actually care about.

Replies

hrmtst93837 • today at 7:47 AM

Chasing test-passing code is basically an invitation for models to learn all sorts of ugly workarounds and accidental patterns that humans would never tolerate for long. If you optimize only for "does it make CI go green" you'll eventually get code that's impossible to reason about and a codebase that accumulates landmines but the metrics sure look pretty for a while.

alt Hacker News

Replies