If you're interested in not reinventing the sandbox for LLMs, consider Judge0: https://judge0.com/
I have absolutely no relation to the project except for the fact that I went to the same Uni as the creator.
I'm using judge0 for a Leetcode-clone I'm working on. Never thought of using it in the context of LLMs, though.
That one looks pretty good - it's been around since 2016, I'm surprise I haven't encountered it before.
It's not quite right for what I'm after because you can't just "pip install" it on multiple platforms.