logoalt Hacker News

ndneighboryesterday at 7:26 PM1 replyview on HN

(Angelo from Railway here)

Heard. Being transparent, usually the delay on ack is us trying to determine and correlate the issue. We have a post mortem going out but we note that first report was in our system 10 minutes before it was acked, to which the platform team was trying to see which layer the impact was at.

That said, this is maybe concern #1 of the support team. Where we want the delta between report and customer outage detected to be as small as possible. The way it usually works is that we have the platform alarms and pages go first, and then the platform engineer usually will page a support eng. to run communications.

Usually the priority is to have the platform engineer focus on triaging the issue and then offload the workload to our support team so that we can accurately state what is going on. We have a new comms clustering system that rolling out so that if we get 5 reports with the similar content, it pages up to the support team as well. (We will roll this out after we communicated with affected customers first.)


Replies

iJohnDoeyesterday at 8:02 PM

Thanks for the reply. Understood.

In situations like this, please dedicate at least one team member to respond as quickly as possible to the Railway Help Station posts. That's where your customers are going for communication and support.