That’s true, I’m trying to figure out a better testing environment with a feedback loop.
I did try letting the models iterate on the bot code based on a summary of an end-of-game ‘report’, but that showed only marginal improvements vs. zero-shot