I'm suspect on how much of a coding advance it will be.
Seems odd that their announcement has zero coding benchmarks, with the closest related thing being terminal bench.
Maybe I'll know once I try it? Honestly, for small functions or methods, I don't think there's a huge difference between models. But the larger the code gets, the more noticeable the difference seems to be.
Personally, I think this kind of coding experience varies from person to person
sadly with all the labs benchmaxxing I feel like you just have to try the model for a while to really evaluate how good it is, especially for each individual use case
>zero coding benchmarks
"What gets measured gets managed"
They claim extreme performance on ExploitBench, which Mythos was touted as being incredible at. https://x.com/OpenAI/status/2070555278576439306
Tracking model performance on Artificial Analysis makes me think these models are constantly optimized/tuned in some way or another. GPT 5.5 was scoring in the mid 60's when it was first released, now it's almost 10 points higher.