On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "...

jkelleyrtp • yesterday at 5:10 PM • 7 replies • view on HN

On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "would I merge this code?")

- Opus 4.7 xhigh: 5.2%

- Opus 4.8 xhigh: 13.4%

- Fable 5 xhigh: 29.3%

Seems like a huge jump.

[1] https://cognition.ai/blog/frontier-code

Replies

amluto • yesterday at 5:39 PM

That blog post really makes it look like it's graded from an LLM's estimation of an OSS maintainer's review. I see three issues:

1. That estimate could easily be wrong.

2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.

3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.

➕ show 1 reply

zzleeper • yesterday at 5:25 PM

How credible is this benchmark? does it correlated with others real world experience?

➕ show 7 replies

swyx • yesterday at 6:53 PM

jump in chart form https://x.com/swyx/status/2064414823748886591/photo/1

OtomotO • yesterday at 7:00 PM

Bummer! When can I finally and confidently get slopcode into Zig?

DonsDiscountGas • yesterday at 9:54 PM

I am shocked at the low scores from previous models. Maybe I just have low code standards but I've generally been vibe coding since 4.6

alt Hacker News

Replies