[flagged] | alt Hacker News

behnamoh • yesterday at 6:35 PM • 2 replies • view on HN

[flagged]

Replies

I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.

➕ show 1 reply

quinnjh • yesterday at 7:26 PM

the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

➕ show 1 reply