Between 08:59 and 13:10 UTC on 1st August 2022, customers experienced an outage of our app. This was triggered by the failure of hardware provisioned by our cloud provider (AWS). We are working on recovering issue event data not processed during this time so there should be no data loss overall.
At around 08:59 UTC, the EBS (Elastic Block Store) storage for the app’s NAT server node failed. This caused it to cease processing of traffic within the app’s private network.
With the NAT server node not functioning, outbound network calls from all app nodes began to fail. This resulted in the app’s core functionality failing:
At 09:08 UTC our monitoring system notified us of UI availability degradation, with alarms for other app components being sent in the following minutes. The NAT server node quickly became the suspect as it was failing to respond to manual commands. Initial diagnosis of the fault was impeded by the fact that AWS status checks for the NAT server node showed as “healthy”, and continued to do so until 11:08:00 UTC.
First, we attempted to terminate and restart the NAT server node in order to trigger a provision of new hardware. However, this failed due to AWS, to our surprise, not having enough capacity for the type of instance used. Subsequent liaison with multiple AWS support engineers resulted in us recreating a new NAT server node with a more available instance type.