Analyzing frontier LLM performance on my favorite daily puzzle game (https://www.nicksypteras.com/blog/cbs-benchmark.html) Next step is to assess how well the LLMs can create their own new, logically satisfiable puzzles in the same style. Then I'll have them battle it out, with one model creating a puzzle and the other attempting to solve it!
Thanks for sharing! I want to have some sort of agentic "helper" to my new puzzles website [1], and I've learned some tips from your post/code, thank you!
Have you given any thought about how to create the puzzles? Do you think it'd possible to create them using LLMs?
[1]: https://www.puzzleship.com