logoalt Hacker News

S1M0N38-hntoday at 12:49 AM0 repliesview on HN

Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).