OpenAI created a benchmark for this: https://openai.com/index/paperbench/
Still has data contamination though.
Still has data contamination though.