Hello, I'm one of the engineers who worked on the incident. We have mitigated the incident as o...

palcu • last Sunday at 10:56 PM • 7 replies • view on HN

Hello, I'm one of the engineers who worked on the incident. We have mitigated the incident as of 14:43 PT / 22:43 UTC. Sorry for the trouble.

Replies

l1n • last Sunday at 11:45 PM

Also an engineer on this incident. This was a network routing misconfiguration - an overlapping route advertisement caused traffic to some of our inference backends to be blackholed. Detection took longer than we’d like (about 75 minutes from impact to identification), and some of our normal mitigation paths didn’t work as expected during the incident.

The bad route has been removed and service is restored. We’re doing a full review internally with a focus on synthetic monitoring and better visibility into high-impact infrastructure changes to catch these faster in the future.

➕ show 6 replies

giancarlostoro • yesterday at 12:59 AM

Any chance you guys could do write ups on these incidents similar to how CloudFlare does? For all the heat some people give them, I trust CloudFlare more with my websites than a lot of other companies because of their dedication to transparency.

➕ show 1 reply

nickpeterson • last Sunday at 11:07 PM

The one time you desperately need to ask Claude and it isn’t working…

dan_wood • last Sunday at 10:58 PM

Can you divulge more on the issue?

Only curious as a developer and dev op. It's all quite interesting where and how things go wrong especially with large deployments like Anthropic.

➕ show 1 reply

dgellow • last Sunday at 11:38 PM

Hope you have a good rest of your weekend

Chance-Device • last Sunday at 11:05 PM

Thank you for your service.

g-mork • last Sunday at 11:38 PM

it's still down get back to work

alt Hacker News

Replies