> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks wi...

anentropic • today at 10:11 AM • 0 replies • view on HN

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

alt Hacker News