Generally when I want to run something with so much parallelism I just write a small Go program instead, and let Go's runtime handle the scheduling. It works remarkably well and there's no execve() overhead too
AFAIK, the Go runtime is pretty NUMA-oblivious. The mcache helps a bit with locality of small allocations, but otherwise, you aren't going to get the same benefits (though I absolutely here you about avoiding execve overhead).
dang and u did all that without a 10 year journey
So, there are a few reasons why forkrun might work better than this, depending on the situation:
1. if what you want to run is built to be called from a shell (including multi-step shell functions) and not Go. This is the main appeal of forkrun in my opinion - extreme performance without needing to rewrite anything. 2. if you are running on NUMA hardware. Forkrun deals with NUMA hardware remarkably well - it distributes work between nodes almost perfectly with almost 0 cross-node traffic.