logoalt Hacker News

fuddleyesterday at 7:49 PM1 replyview on HN

I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.


Replies

joshuahedlundyesterday at 8:33 PM

I would love for SWE Verified to put out a set of fresh but comparable problems and see how the top performing models do, to test against overfitting.