I don't think there are any fundamental bottlenecks here. There's more scheduling overhead when you have a hundred processes on a single core than if you have a hundred processes on one hundred cores.
The bottlenecks are pretty much hardware-related - thermal, power, memory and other I/O. Because of this, you presumably never get true "288 core" performance out of this - as in, it's not going to mine Bitcoin 288 as fast as a single core. Instead, you have less context-switching overhead with 288 tasks that need to do stuff intermittently, which is how most hardware ends up being used anyway.
> I don't think there are any fundamental bottlenecks here.
You memory only has so much bandwidth, but now it's shared by even more cores.
Maybe no fundamental bottlenecks but it's easy to accidentally write software that doesn't scale as linearly as it should, e.g. if there's suddenly more lock contention than you were expecting, or in a more extreme case if you have something that's O(n^2) in time or space, where n is core count.