Hello, I'm one of the engineers who worked on the incident. We have mitigated the incident as of 14:43 PT / 22:43 UTC. Sorry for the trouble.
Any chance you guys could do write ups on these incidents similar to how CloudFlare does? For all the heat some people give them, I trust CloudFlare more with my websites than a lot of other companies because of their dedication to transparency.
The one time you desperately need to ask Claude and it isn’t working…
Can you divulge more on the issue?
Only curious as a developer and dev op. It's all quite interesting where and how things go wrong especially with large deployments like Anthropic.
Hope you have a good rest of your weekend
Thank you for your service.
it's still down get back to work
Also an engineer on this incident. This was a network routing misconfiguration - an overlapping route advertisement caused traffic to some of our inference backends to be blackholed. Detection took longer than we’d like (about 75 minutes from impact to identification), and some of our normal mitigation paths didn’t work as expected during the incident.
The bad route has been removed and service is restored. We’re doing a full review internally with a focus on synthetic monitoring and better visibility into high-impact infrastructure changes to catch these faster in the future.