logoalt Hacker News

collinwilkinstoday at 4:54 PM0 repliesview on HN

at this point it seems every new model scores within a few points of each other on SWE-bench. the actual differentiator is how well it handles multi-step tool use without losing the plot halfway through and how well it works with an existing stack