logoalt Hacker News

chasd0010/11/20243 repliesview on HN

What happens if you sit down and invent a logic game that is brand new and has never been documented before anywhere then ask an LLM to solve it? That, to a layman like me, seems like a good way to measure reasoning in AI.


Replies

Analemma_10/11/2024

You can do this, but at that point what are you really benchmarking? If you invent a de novo logic puzzle and give it to 100 people on the street, most of them won't be able to solve it either. If your aim is to prove "LLMs can't really think like humans can!", this won't accomplish that.

jprete10/11/2024

I think the problem is inventing new structures for logic games. The shape of the problem ideally would be different than any existing puzzle, and that's hard. If a person can look at it and say "oh, that's just the sheep-wolf-cabbage/liar-and-truthteller/etc. problem with extra features" then it's not an ideal test because it can be pattern-matched.

layer810/11/2024

This is being done, but the difficulties are: (1) How do you assess that it is really brand-new and not just a slight variation of an existing one? (2) Once you publish it, it stops being brand-new, so its lifetime is limited and you can’t build a longer-term reproducible test out of it.