This looks great. Well reasoned, tons of work put into eval, thanks for building it. It strikes me...

vessenes • yesterday at 10:11 PM • 1 reply • view on HN

This looks great. Well reasoned, tons of work put into eval, thanks for building it.

It strikes me as kind of wild that good evals can drive tens to hundreds of millions of dollars of compute deployment in the wild — there’s something new and collaborative and competitive about the eval / frontier model race that’s quite interesting..

In this case “shorter actually mergable patches that open source maintainers would accept” feels like a great thing to deliver to the world.

I didn’t deep dive into good and bad patches, but I wonder if swyx or others on the team have predictions on saturation. Both when, and how useful will it be? That is, do you guys think this test is broad enough as written to get better behavior out of models, and if there is saturation on this test, will we see generalized better patch / coding behavior?

Replies

swyx • yesterday at 10:20 PM

thanks - credit to silas, eric, ben, and team for the depth of the evals, and the rest of the research team for doing the transcript reading parties lol

by nature of being based on open source, frontiercode public will saturate very very quickly. frontiercode main will be >80% in less than a year. hopefully diamond will last a bit longer. we can do annual refreshes, thats not my strategy for staying relevant - what i'm more excited to get funding for is private held out version of frontiercode based on repros of real enterprise customer problems. in an ideal agent lab (https://latent.space/p/agent-labs) you meticulously build up this domain understanding and that is essentially why both model labs and serious customers come to you.

alt Hacker News

Replies