logoalt Hacker News

jkelleyrtpyesterday at 5:10 PM7 repliesview on HN

On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "would I merge this code?")

- Opus 4.7 xhigh: 5.2%

- Opus 4.8 xhigh: 13.4%

- Fable 5 xhigh: 29.3%

Seems like a huge jump.

[1] https://cognition.ai/blog/frontier-code


Replies

amlutoyesterday at 5:39 PM

That blog post really makes it look like it's graded from an LLM's estimation of an OSS maintainer's review. I see three issues:

1. That estimate could easily be wrong.

2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.

3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.

show 1 reply
zzleeperyesterday at 5:25 PM

How credible is this benchmark? does it correlated with others real world experience?

show 7 replies
OtomotOyesterday at 7:00 PM

Bummer! When can I finally and confidently get slopcode into Zig?

DonsDiscountGasyesterday at 9:54 PM

I am shocked at the low scores from previous models. Maybe I just have low code standards but I've generally been vibe coding since 4.6

show 1 reply
hydra-fyesterday at 5:17 PM

Yes, and the price reflects that

show 1 reply
m3kw9yesterday at 5:32 PM

FrontierCode is likely paid for by anthropic.

show 2 replies