logoalt Hacker News

viraptoryesterday at 7:13 PM0 repliesview on HN

Has there been any announcement of a new programming benchmark? SWE looks like it's close to saturation already. At this point for SWE it may be more interesting to start looking at which types of issues consistently fail/work between model families.