I think it is important to try to find more rigorous things to test than the general sentiment of the people using the tools. If only because the more benchmarks we have the more we can improve models without regressions. METR is asking a really interesting question here, "are models improving at making one shot PRs?". The answer seems to be, yes, but slower than benchmarks suggest, if you look at the pass rate of different versions of Claude Sonnet. A reasonable answer is "you're not supposed to use them by making one shot PRs", but then ideally we would need to have some kind of standarized test for the ability of models to incorporate feedback and evolve PRs.