logoalt Hacker News

threeptsyesterday at 4:00 PM2 repliesview on HN

Optimal for judging actual reasoning ability rather than an LLM's ability to regurgitate knowledge from a necropost on HN/Reddit/Twitter from 2018.


Replies

jjmarryesterday at 9:03 PM

I'm making an LLM agent that can play DS games. The biggest blocker is clicking on the right spot to move things around in space rather than reasoning abilities.

Arc AGI seems to test that as well. Every game is a rectangular grid to make it as easy as possible yet the AIs still fail.

I'm fairly certain the way forward isn't through agents directly interfacing with UIs but through agents using scripts and other tools to interact with the interface. That's why harnesses are so critical to performance on tasks like this.

I would like a version of Arc AGI that tests the agent's ability to dynamically create these harnesses.

show 1 reply
knollimaryesterday at 4:12 PM

a small harness that stores text files and manages context could be useful, otherwise you lose all ability to measure that skill (and that's important because it represents real world use cases on large code bases)