logoalt Hacker News

behnamohyesterday at 6:35 PM2 repliesview on HN

[flagged]


Replies

smokelyesterday at 7:25 PM

I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.

show 1 reply
quinnjhyesterday at 7:26 PM

the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

show 1 reply