They focus on minimizing the number of moves and don't allow any harness whatsoever, putting th...

falcor84 • yesterday at 3:36 PM • 1 reply • view on HN

They focus on minimizing the number of moves and don't allow any harness whatsoever, putting the bar extremely high. The current top verified contender (Claude Opus 4.6) is at only 0.45%. But with how new it is, I expect a lot of improvement in the next generation of models.

Replies

threepts • yesterday at 4:00 PM

Optimal for judging actual reasoning ability rather than an LLM's ability to regurgitate knowledge from a necropost on HN/Reddit/Twitter from 2018.

➕ show 2 replies

alt Hacker News

Replies