logoalt Hacker News

UltraSanetoday at 6:13 AM1 replyview on HN

It isn't arbitrary. They want measure the capability of the general LLM


Replies

fc417fc802today at 8:56 AM

So if I say "I want to measure your capability as a mechanic" but then also "to ensure an accurate score you're forbidden to use any tools" how are you the human mechanic planning to diagnose and fix the engine problem without wrenches and jack stands and the like? It makes no sense.

That said their harness isn't generic. It includes a ridiculously detailed prompt for how to play this specific game. Forbidding tool use is arbitrary and above all pointless hoop jumping but that doesn't make the linked "achievement" any less fraudulent.

show 1 reply