On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "would I merge this code?")
- Opus 4.7 xhigh: 5.2%
- Opus 4.8 xhigh: 13.4%
- Fable 5 xhigh: 29.3%
Seems like a huge jump.
How credible is this benchmark? does it correlated with others real world experience?
jump in chart form https://x.com/swyx/status/2064414823748886591/photo/1
Bummer! When can I finally and confidently get slopcode into Zig?
I am shocked at the low scores from previous models. Maybe I just have low code standards but I've generally been vibe coding since 4.6
That blog post really makes it look like it's graded from an LLM's estimation of an OSS maintainer's review. I see three issues:
1. That estimate could easily be wrong.
2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.
3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.