It would be more interesting to make it build a chess engine and compare it against Stockfish. The chess engine should be a standalone no-dependencies C/C++ program that fits in NNN lines of code.
Comparing against stockfish isn't fair. That's comparing against enormous amounts of compute spent experimenting with strategies, training neutral nets, etc.
It will lose so badly there will be no point in the comparison.
Besides you could compare models (and harnesses) directly against eachother.
oh that is super interesting. ty for the idea!
My back-of-the-envelope guess would be that 99% of LLMs given the task to build a chess engine would probably just end up implementing a flavor of negamax and calling it a day.
https://en.wikipedia.org/wiki/Negamax