The only real world task benchmark I know of is Scale Labs RLI

dakolli • yesterday at 9:46 PM • 0 replies • view on HN

https://labs.scale.com/leaderboard/rli

Its clear to me these models are useless on any real world task, a 4% pass rate on $20-30/hr Upwork tasks. This whole trend of agentic engineering is a giant money grab.

alt Hacker News