logoalt Hacker News

chaosprintyesterday at 7:22 PM1 replyview on HN

SWE's results were actually very close, but they used a poor marketing visualization. I know this isn't a research paper, but for Anthropic, I expect more.


Replies

flakinessyesterday at 10:35 PM

They should've used an error rate instead of the pass rate. Then it'll get the same visual appeal without cheating.