More than by the downtime I am much more surprised by the actual uptime. Hard to imagine how difficult this must be, given the speed of growth.
On the other hand, the status page is blaming the authentication system, which one would think is not a frontier-class problem.
Would have thought that compared to training the serving part is pretty easy. Less of a “everything needs to come together at once” and more just move demand to a working cluster if one bombs & have some spare capacity
Truly! As someone who's worked with HPC and GPUs in a scientific research context, trying to get a service like this to work reliably is a different ballgame to your usual webapp stack...