I've had to repeatedly tell our AWS account reps that we're not even a little interested in the Trainium or Inferentia instances unless they have a provably reliable track record of working with the standard libraries we have to use like Transformers and PyTorch.
I know they claim they work, but that's only on their happy path with their very specific AMI's and the nightmare that is the neuron SDK. You try to do any real work with them and use your own dependencies and things tend to fall apart immediately.
It was just in the past couple years that it really became worthwhile to use TPU's if you're on GCP and that's only with the huge investment on Google's part into software support. I'm not going to sink hours and hours into beta testing AWS's software just to use their chips.
IMO AWS once you get off the core services is full of beta services. S3, Dynamo, Lambda, ECS, etc are all solid. But there are a lot of services they have that have some big rough patches.