Is there any indication these errors are related to Anthropic-written code as opposed to operational issues from the fastest-growing infra buildout ever?
Layer-wise, the app is pretty far removed from request routing to GPU pools.
I'm not sure if that's really an Anthropic problem you're pointing to vs a problem that their infra layer handles (Amazon, Google, whatever hyperscaler). i.e, they might be scaling quickly but they are running on top of established infrastructure.
This is almost certainly a software issue, though. Even if it's due to scaling, they still built a system that failed catastrophically rather than degrading gracefully.