> this uses a harness
This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.
Right, fair, but look at the prompt. For the purpose of testing general intelligence, this seems kind of pointless.
It isn't arbitrary. They want measure the capability of the general LLM
Right, fair, but look at the prompt. For the purpose of testing general intelligence, this seems kind of pointless.