SWE's results were actually very close, but they used a poor marketing visualization. I know this isn't a research paper, but for Anthropic, I expect more.
They should've used an error rate instead of the pass rate. Then it'll get the same visual appeal without cheating.
They should've used an error rate instead of the pass rate. Then it'll get the same visual appeal without cheating.