I tried adding GPT 5.5 Pro to a vulnerability scanning benchmark I made (https://swelljoe.com/post/will-it-mythos/), and it blew through the $100 budget limit halfway through. DeepSeek V4 Pro cost about a dollar for the whole benchmark. GPT Pro cost an average of $22 per case (a case could be 1-5 files with a recent known vulnerability, usually just a single file and a prompt along the lines of "does this file have any vulnerabilities").
GPT 5.5 Pro found two out of four cases that it got to before blowing its budget. Maybe it would have been the best of the bunch with infinite budget, but Opus 4.8, DeepSeek V4 Pro, and MiMo 2.5 Pro found four of nine of the bugs. Opus was an order of magnitude cheaper than GPT 5.5 Pro (and something like 30% cheaper than GPT 5.5), DeepSeek and MiMo were two orders of magnitude cheaper at roughly a dime per case.
GPT Pro also chews a lot and a long time, relatively speaking.
I can't come up with a use case where I can rationally spend ~31 times what Opus costs to use GPT 5.5 Pro, and I won't be doing any more benchmarking with it.
Given how much token costs are becoming an issue people talk about, the fact that there are models that cost dramatically less than the big American providers is going to be an issue for Anthropic and OpenAI. I'm happy to pay a premium (within reason) for the best model for interactive coding, but for API use, where having the model repeat it itself, compare against other models, have models judge other models work, etc. is not time-consuming for a human and is just a matter of implementing the harnesses and framework for proving correctness, I can't come up with a reason to spend ten or two hundred times as much as DeepSeek.
Can you include GPT 5.5 non-pro (extra high thinking I guess) in your comparison? GPT Pro is the "I am willing to torch cash for a sooometimes slighty better result" option, not the one people are actually expected to use daily. That's probably part of the reason it's not in Codex
Great work - I think the intuition is correct - much of the “Mythos moment” can probably be recreated with a proper harness and a solid model with not so many silly guardrails.
And nice to see the cheap models doing so well.
I have been saying that from multiple of my tests you can use Claude Code with DS4 Pro or Flash (you just swap api keys) at more or less equivalent performance and people keep screaming "that it's not SOTA".
I don't know whether models are over fitted to benchmarks and people take them at face value, but I spend less on DS4 apis than I do for Claude Code 100$ subscription and I code everyday. So far I'm quite happy with the results.
You might be interested in this:
> With $3.88 & 690,003,591 tokens and 5 hours, Deepseek Pro & Flash combined, managed to reverse engineer Teamspeak's Licensing System for 3.13.8 (latest of post)
https://www.reddit.com/r/DeepSeek/comments/1txcfrh/with_388_...