logoalt Hacker News

stephc_int13yesterday at 9:41 PM1 replyview on HN

This is a nice benchmark IMO. I would be curious to see how competitors and improved models would compare.


Replies

NitpickLawyeryesterday at 9:55 PM

And how long will it take before an open model recreates this. The "vibe" consensus before "thinking" models really took off was that open was ~6mo behind SotA. With the massive RL improvements, over the past 6 months I've thought the gap was actually increasing. This will be a nice little verifiable test going forward.