I recently was trying to build an AI assistant to help with various chess things, which took shape of an MCP server: https://github.com/shelajev/mcp-stockfish
it builds as a docker image which has stockfish and maia (maiachess.com) together with different weights so it can simulate lower-level players.
It was a fun exercise, I tried a bunch of local models with this MCP server, which isn't particularly optimized, but also doesn't seem that bad. And the results were quite disappointing, they often would invent chess related reasoning and mess up answering questions, even if you'd expect them to rely on the tools and have true evaluation available.
It was also fun to say things: fetch a random game by username 'X' from lichess, analyze it and find positions which are good puzzles for a player rated N.
and see it figure out the algorithm of tool calls: - fetch the game - feed the moves to stockfish - find moves where evaluation changed sharply - feed it to maia at strength around N and to stockfish - if these disagree, it's probably a good puzzle.
I don't think I got to have a working setup like that even with managed cloud models. Various small issues, like timeouts on the MCP calls, general unreliability, etc. Then lost interest and abandoned the idea.
I should try again after seeing this thread
It is shockingly difficult to use LLMs in chess study. I don't need it to be a better (or worse) Stockfish; an LLM should be great at taking a FEN or multiple lines from Stockfish via MCP or tool call and explain why positions are evaluated the way they are, typical plans in the position (drawing from pretraining knowledge of a vast archive of games), and how to explain to a human to study these positions.
I suspect that the large amount of chess pretraining data is not well synchronized with positions, because in books and articles the text is typically accompanied by pictures of the positions, NOT FENs / PGNs. So the training on the text is decoupled from the representation of the position.
Regarding your tool call thing with stockfish/maia, I made a tool like this for myself called Blunder Sniper which iteratively feeds positions - that I'm likely to get given my openings - and recursively calls the lichess DB and finds the first time in the top 80% played moves in each chain where the opponents in the rating range blunder as the most common move in the chain.
It was a fun way to use an alternative to engine-based preparation that many strong players use, which is something like Nibbler + lc0 and using contempt high values to find higher variance lines rather than game theory optimal ones.
Some day I'll expand on the gpt-chess articles [0] that I found super interesting, fine-tune models... well, I keep telling myself that, anyway...
[0]: https://dynomight.net/more-chess/