For those curious on a few of the metrics, besides $/token, tokens/s, latency, context siz...

molticrystal • 08/01/2025 • 1 reply • view on HN

For those curious on a few of the metrics, besides $/token, tokens/s, latency, context size, they use the results from:

    MMLU-Pro (Reasoning & Knowledge)  
    GPQA Diamond (Scientific Reasoning)  
    Humanity's Last Exam (Reasoning & Knowledge)  
    LiveCodeBench (Coding)  
    SciCode (Coding)  
    HumanEval (Coding)  
    MATH-500 (Quantitative Reasoning)  
    AIME 2024 (Competition Math)  
    Chatbot Arena  (selectively used)

Replies

NitpickLawyer • 08/01/2025

> Humanity's Last Exam (Reasoning & Knowledge)

Article yesterday was saying that ~30% of the chemistry/biology questions on HLE were either wrong, misleading or highly contested in scilit.

alt Hacker News

Replies