The full dataset is here - https://huggingface.co/datasets/AI-MO/aimo-validation-aime you can use the eval script I have in optillm to benchmark on it - https://github.com/codelion/optillm/blob/main/scripts/eval_a...