I wouldn't call it a benchmark since it's just one sample. They do highlight a real proble...

euphetar • today at 12:37 AM • 0 replies • view on HN

I wouldn't call it a benchmark since it's just one sample. They do highlight a real problem, though. Computer use is immature right now and far behind language agents

Try playing fruit ninja via text and llm toolcalls though

alt Hacker News