My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.