For those curious on a few of the metrics, besides $/token, tokens/s, latency, context size, they use the results from:
MMLU-Pro (Reasoning & Knowledge)
GPQA Diamond (Scientific Reasoning)
Humanity's Last Exam (Reasoning & Knowledge)
LiveCodeBench (Coding)
SciCode (Coding)
HumanEval (Coding)
MATH-500 (Quantitative Reasoning)
AIME 2024 (Competition Math)
Chatbot Arena (selectively used)
> Humanity's Last Exam (Reasoning & Knowledge)
Article yesterday was saying that ~30% of the chemistry/biology questions on HLE were either wrong, misleading or highly contested in scilit.