Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world kn...

S1M0N38-hn • today at 12:49 AM • 0 replies • view on HN

Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).

alt Hacker News