Affected by the outage since about 6:15 AM PT this morning. We're still down as of 9:00 AM PT.
Our existing containers were in a failure state and are now are in a partial failure state. Containers are running, but underlying storage/database is offline.
Many questions on their forum are similar to our situation. People wondering if they should restart their containers to get things working again. Worried about if they should do anything, risk losing data if they do anything, or just give everything more time.
I'm glad Railway updated their status page, but more details need to be posted so everyone knows what to do now.
Everyone has outages, it's the way of life and technology. Communication with your customers always makes it less painful and people remember good communication and not the outage. Railway, let's start hearing more communication. Forum is having problems as well. Thanks.
(Angelo from Railway here)
Heard. Being transparent, usually the delay on ack is us trying to determine and correlate the issue. We have a post mortem going out but we note that first report was in our system 10 minutes before it was acked, to which the platform team was trying to see which layer the impact was at.
That said, this is maybe concern #1 of the support team. Where we want the delta between report and customer outage detected to be as small as possible. The way it usually works is that we have the platform alarms and pages go first, and then the platform engineer usually will page a support eng. to run communications.
Usually the priority is to have the platform engineer focus on triaging the issue and then offload the workload to our support team so that we can accurately state what is going on. We have a new comms clustering system that rolling out so that if we get 5 reports with the similar content, it pages up to the support team as well. (We will roll this out after we communicated with affected customers first.)