Passing tests doesn’t mean you have a working codebase. Benchmarks that rely on a fixed test suite create a real optimization problem agents (or/and even humans) learn to satisfy the tests rather than preserve the deeper properties that make the system maintainable. AI write test cases which it thinks is easier for it to satisfy and not adhere-ing to business logic
We see this firsthand at Prismor with auto generated security fixes. Even with the best LLMs, validating fixes is the real bottleneck our pipeline struggles to exceed 70% on an internal golden dataset (which itself is somewhat biased).
Many patches technically fix the vulnerability but introduce semantic regressions or architectural drift. Passing tests is a weak signal and proving a fix is truly safe to merge is much harder