logoalt Hacker News

anentropictoday at 10:11 AM0 repliesview on HN

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks