The only real world task benchmark I know of is Scale Labs RLI
https://labs.scale.com/leaderboard/rli
Its clear to me these models are useless on any real world task, a 4% pass rate on $20-30/hr Upwork tasks. This whole trend of agentic engineering is a giant money grab.