“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.