You can do this, but at that point what are you really benchmarking? If you invent a de novo logic puzzle and give it to 100 people on the street, most of them won't be able to solve it either. If your aim is to prove "LLMs can't really think like humans can!", this won't accomplish that.