https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report
No way is Muse Spark generally better than offerings from Google and OpenAI. I actually find arena to be amongst the most useless indicators
I'm finding it a little hard to believe that GPT 5.5 is in 11th place for webdev, outranked by models like Kimi, Qwen, and Z.ai. I'm not saying it's not true (I have noticed GPT being less smart in recent weeks), but this is very different from my expectation.
On paper it's one of the best because it's meant to be blind comparison of your own prompts. However if you are someone who geeks hard on one or a few models, you learn their "personality" and can recognize them in a blind test.
Have you seen https://deepswe.datacurve.ai/blog? This is the closest to a vibe check i’ve felt even with the open models.
If you don't know their methodology, or anything about it why do you think its a good ranker?
Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.
I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).
Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].
[0]: https://aibenchy.com
[1]: https://news.ycombinator.com/item?id=48230368