How do you test it across different workloads and are you running it in a datacenter or cloud provider?
I forgot to mention it but the other major problem I underestimated was giving the permission to potentially spend lots of money to AI calling each other in ways I didn't have a good way to monitor, and didn't want to actively watch. So I wanted to set budgets and have them get passed to children, and realized that meant I had to build a pretty complicated billing/scheduling system with a way to keep the part of it with all the permissions and money safe from the AI doing AI stuff on its own, and set up NAT and firewalls and all this other stuff.
If every child can loop back up to its parent, and everything can run stuff from the Internet, and make expensive resource decisions, and get restarted if it fails, then it might not ever converge on being done, or get infected or just mess up and spend a lot of money. I ask about the testing matrix/driver you're using because that's where I realized there was a lot of work and cost involved in getting that part working well enough to run real workloads.
I have a 'node/container' abstraction at the infra/engine layer which is essentially either a cloud VM or a local podman container. The engine/infra layer can spin up more of these as needed. I have a relatively beefy dedicated machine for working with AI, which is where I do most of the testing.
I aggressively try to keep costs down so the workflow DSL I have supports configurable limits which can be set at the $, token, or time dimension , at task, workflow and agent/node levels, with some same defaults. I have a pipeline which keeps LLM API pricing data up-to-date, and I use AI to estimate total costs before runs and manually approve those.