LLM leaderboard – Comparing models from OpenAI, Google, DeepSeek and others

64 points • by bookofjoe • 08/01/2025 • 39 comments • view on HN

Comments

witnessme • 08/01/2025

Surprised to find out grok 3 mini is so economic and ranks higher than equivalent gpt models. I run most of my agents on gpt4.1 mini, might switch now

➕ show 1 reply

molticrystal • 08/01/2025

For those curious on a few of the metrics, besides $/token, tokens/s, latency, context size, they use the results from:

    MMLU-Pro (Reasoning & Knowledge)  
    GPQA Diamond (Scientific Reasoning)  
    Humanity's Last Exam (Reasoning & Knowledge)  
    LiveCodeBench (Coding)  
    SciCode (Coding)  
    HumanEval (Coding)  
    MATH-500 (Quantitative Reasoning)  
    AIME 2024 (Competition Math)  
    Chatbot Arena  (selectively used)

➕ show 1 reply

loehnsberg • 08/01/2025

Interesting to learn that o4-mini-high has the highest intelligence/$ score here at par with o3-pro which is twice as expensive and slow.

➕ show 1 reply

globular-toast • 08/01/2025

Whenever you present a table with sorting ability you might as well make the first click ascending or descending according to what makes the most sense for that column. For example I'm highly unlikely to be interested in which model has the smallest context window, but it's always two clicks to find which one has the highest.

Sorting null values first isn't very useful either.

➕ show 1 reply

pogue • 08/01/2025

Look at that bar graph comparing the price of every model compared to Claude Opus

It's a shame it's so good for coding

https://artificialanalysis.ai/models/claude-4-opus-thinking/...

➕ show 2 replies

__mharrison__ • 08/01/2025

Here's my plot based on Aider benchmarks

https://www.linkedin.com/posts/panela_important-plot-for-fol...

energy123 • 08/01/2025

You can consider the o3/o4-mini price to be half that due to flex processing. Flex gives the benefits of the batch API without the downside of waiting for a response. It's not marketed that way but that is my experience. With 20% cache hits I'm averaging around $0.8/million input tokens and $4/million output tokens.

➕ show 2 replies

Garlef • 08/01/2025

Is there an option to filter the list based on the measurements? I.e "context window > X, intelligence > Y, price < Z"? I'd love that.

It seems the only filter options available are unrelated to the measured metrics.

(I might have missing this since the UI is a bit cluttered.)

dang • 08/01/2025

Benchmarks and comparison of LLM AI models and API hosting providers - https://news.ycombinator.com/item?id=39014985 - Jan 2024 (70 comments)

l5870uoo9y • 08/01/2025

It is interesting that it ranks `GPT-4.1 mini` higher than `GPT-4.1` (the latter costing five times more).

pinoy420 • 08/01/2025

[dead]

bboygravity • 08/01/2025

[flagged]

➕ show 1 reply

cc-d • 08/01/2025

How about adding a freedom measurement in those columns?

➕ show 1 reply

LeoPanthera • 08/01/2025

Is there an index for judging how much a model distorts the truth in order to comply with a political agenda?

➕ show 2 replies

alt Hacker News

LLM leaderboard – Comparing models from OpenAI, Google, DeepSeek and others

Comments