logoalt Hacker News

operatingthetanyesterday at 7:55 PM4 repliesview on HN

>hopefully changes the way benchmarking is done.

Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.


Replies

siva7yesterday at 8:07 PM

Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?

show 4 replies
ZeroGravitasyesterday at 8:24 PM

In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.

show 1 reply
Leynosyesterday at 8:02 PM

Also, fuzz your benchmarks

Aperockyyesterday at 11:49 PM

solution is simple:

if bug { dont }

/s