logoalt Hacker News

stingraycharlestoday at 9:48 AM0 repliesview on HN

“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.