SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I...

yaodub • yesterday at 5:38 PM • 0 replies • view on HN

SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I was trying to do long before code quality becomes the issue.

alt Hacker News