logoalt Hacker News

zsoltkacsandiyesterday at 9:39 PM1 replyview on HN

No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.


Replies

baqyesterday at 9:53 PM

This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).

show 2 replies