logoalt Hacker News

HarHarVeryFunnyyesterday at 5:57 PM5 repliesview on HN

I'm suspect on how much of a coding advance it will be.

Seems odd that their announcement has zero coding benchmarks, with the closest related thing being terminal bench.


Replies

hereme888yesterday at 6:31 PM

Tracking model performance on Artificial Analysis makes me think these models are constantly optimized/tuned in some way or another. GPT 5.5 was scoring in the mid 60's when it was first released, now it's almost 10 points higher.

jdw64yesterday at 6:05 PM

Maybe I'll know once I try it? Honestly, for small functions or methods, I don't think there's a huge difference between models. But the larger the code gets, the more noticeable the difference seems to be.

Personally, I think this kind of coding experience varies from person to person

vanuatuyesterday at 6:08 PM

sadly with all the labs benchmaxxing I feel like you just have to try the model for a while to really evaluate how good it is, especially for each individual use case

MangoCoffeeyesterday at 7:46 PM

>zero coding benchmarks

"What gets measured gets managed"

artursapekyesterday at 6:02 PM

They claim extreme performance on ExploitBench, which Mythos was touted as being incredible at. https://x.com/OpenAI/status/2070555278576439306

show 2 replies