Yeah we selected models that are most commonly integrated in developer workflows and being used for ...

khurdula • yesterday at 6:12 PM • 2 replies • view on HN

Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning.

For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.

Good point tho, will add this point in the blog too :)

Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.

Replies

staticshock • yesterday at 9:26 PM

The value of such a benchmark, to me, would be, "what is peak performance", not just "what is mid-tier performance". Also, possibly, "what's the per-dollar performance". Time and money permitting, I'd really want to see your benchmark extended to the large reasoning models.

stared • yesterday at 7:54 PM

Then the way to go is to use Pareto frontier, e.g. https://quesma.com/benchmarks/binaryaudit/#cost

If you want to avoid using Opus 4.7 them why GPT-5.4 (unless with a disclaimer that it is low reasoning setting, or check that on medium its price is comparable with Haiku/Flash).

Also, usually it is good to look at the newest model. Gemini 2.5 Flash is quite dated. Gemini 3.1 Flash Lite is the new one (https://openrouter.ai/google/gemini-3.1-flash-lite-preview).

alt Hacker News

Replies