logoalt Hacker News

languid-photicyesterday at 11:44 PM0 repliesview on HN

makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]

they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)

and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness

[1] https://voratiq.com/blog/test-evals-are-not-enough/