SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactly that. Real codebases have style constraints, architecture choices, and maintainability concerns that don't show up in any test suite. Not surprised at all that the PRs wouldn't get merged; you'd expect that from an eval that can't measure what reviewers actually care about.
Chasing test-passing code is basically an invitation for models to learn all sorts of ugly workarounds and accidental patterns that humans would never tolerate for long. If you optimize only for "does it make CI go green" you'll eventually get code that's impossible to reason about and a codebase that accumulates landmines but the metrics sure look pretty for a while.