>hopefully changes the way benchmarking is done.
Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.
In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.
Also, fuzz your benchmarks
solution is simple:
if bug { dont }
/s
Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?