Every ai labs train on the test set. That is a big part of why we see benchmark climbing from 1% to 30% after a few models iterations
Models themselves definitely aren't getting better.
Models themselves definitely aren't getting better.