logoalt Hacker News

smokelyesterday at 7:25 PM1 replyview on HN

I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.


Replies

blibbleyesterday at 8:50 PM

it ceases to be a useful benchmark of general ability when you post it publicly for them to train against