No one talked about determinism. First it was able to do a task, second time not. It’s not that the ...

zsoltkacsandi • yesterday at 9:39 PM • 1 reply • view on HN

No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.

Replies

baq • yesterday at 9:53 PM

This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).

➕ show 2 replies

alt Hacker News

Replies