Yeah this matches what we've seen too. The biggest gains we got weren't from switching models, it was from investing in better context, giving the agent a well structured spec, relevant code samples from the repo, and explicit constraints upfront. Without that, even the best models will happily produce working but unmaintainable code. Feels like the whole SWE-bench framing misses this, passing tests is the easy part, fitting into an existing codebase's patterns and conventions is where it actually gets hard.