logoalt Hacker News

palculast Sunday at 10:56 PM7 repliesview on HN

Hello, I'm one of the engineers who worked on the incident. We have mitigated the incident as of 14:43 PT / 22:43 UTC. Sorry for the trouble.


Replies

l1nlast Sunday at 11:45 PM

Also an engineer on this incident. This was a network routing misconfiguration - an overlapping route advertisement caused traffic to some of our inference backends to be blackholed. Detection took longer than we’d like (about 75 minutes from impact to identification), and some of our normal mitigation paths didn’t work as expected during the incident.

The bad route has been removed and service is restored. We’re doing a full review internally with a focus on synthetic monitoring and better visibility into high-impact infrastructure changes to catch these faster in the future.

show 6 replies
giancarlostoroyesterday at 12:59 AM

Any chance you guys could do write ups on these incidents similar to how CloudFlare does? For all the heat some people give them, I trust CloudFlare more with my websites than a lot of other companies because of their dedication to transparency.

show 1 reply
nickpetersonlast Sunday at 11:07 PM

The one time you desperately need to ask Claude and it isn’t working…

dan_woodlast Sunday at 10:58 PM

Can you divulge more on the issue?

Only curious as a developer and dev op. It's all quite interesting where and how things go wrong especially with large deployments like Anthropic.

show 1 reply
dgellowlast Sunday at 11:38 PM

Hope you have a good rest of your weekend

Chance-Devicelast Sunday at 11:05 PM

Thank you for your service.

g-morklast Sunday at 11:38 PM

it's still down get back to work