A customer of mine has two projects. One running on their own hardware, Django + Celery. The other one running on AWS EC2, Django alone.
In the first one we use Celery to run some jobs that may last from a few seconds to some minutes. In the other one we create a new VM and make it run the job and we make it self destroy on job termination. The communication is over a shared database and SQS queues.
We have periodic problems with celery: workers losing connection with RabbitMQ, Celery itself getting stuck, gevent issues maybe caused by C libraries but we can't be sure (we use prefork for some workers but not for everything)
We had no problems with EC2 VMs. By the way, we use VirtualBox to simulate EC2 locally: a Python class encapsulates the API to start the VMs and does it with boto3 in production and with VBoxManage in development.
What I don't understand is: it's always Linux, amd64, RabbitMQ but my other customer using Rails and Sidekiq has no problems and they run many more jobs. There is something in the concurrency stack inside Celery that is too fragile.