I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way...

smokel • yesterday at 7:25 PM • 1 reply • view on HN

I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.

Replies

blibble • yesterday at 8:50 PM

it ceases to be a useful benchmark of general ability when you post it publicly for them to train against

alt Hacker News

Replies