System outage
Incident Report for The Plugin People
Postmortem

Summary

Between 08:59 and 13:10 UTC on 1st August 2022, customers experienced an outage of our app. This was triggered by the failure of hardware provisioned by our cloud provider (AWS). We are working on recovering issue event data not processed during this time so there should be no data loss overall.

Leadup

At around 08:59 UTC, the EBS (Elastic Block Store) storage for the app’s NAT server node failed. This caused it to cease processing of traffic within the app’s private network.

Fault

With the NAT server node not functioning, outbound network calls from all app nodes began to fail. This resulted in the app’s core functionality failing:

  • app user interface and processing of inbound email were unavailable until the outage was resolved
  • issue events were not processed during the outage, however recovery/processing of this event data is under way

Detection

At 09:08 UTC our monitoring system notified us of UI availability degradation, with alarms for other app components being sent in the following minutes. The NAT server node quickly became the suspect as it was failing to respond to manual commands. Initial diagnosis of the fault was impeded by the fact that AWS status checks for the NAT server node showed as “healthy”, and continued to do so until 11:08:00 UTC.

Mitigation and resolution

First, we attempted to terminate and restart the NAT server node in order to trigger a provision of new hardware. However, this failed due to AWS, to our surprise, not having enough capacity for the type of instance used. Subsequent liaison with multiple AWS support engineers resulted in us recreating a new NAT server node with a more available instance type.

Lessons learnt

  • Issue event data recovery plans need to be improved to reduce recovery time - we are improving systems to allow us to get this data processed as soon as possible now and in future
  • Before the incident, we had identified the NAT server node as a point of failure and work had begun to remove the need for it entirely
  • While cloud computing capacity can seem limitless from a consumer perspective, the reality is that even the biggest cloud providers can run out of hardware capacity - always have a plan B for the unexpected
Posted Aug 02, 2022 - 20:08 UTC

Resolved
This incident has been resolved for a while now. Further post-mortem write-up and information to come.
Posted Aug 01, 2022 - 15:20 UTC
Monitoring
AWS engineers have fixed the issues on their side and have completed configuration changes. A post-mortem will be written.
Posted Aug 01, 2022 - 13:07 UTC
Update
We are currently communicating with our cloud infrastructure provider (AWS) as the problem appears to stem from their side.
Posted Aug 01, 2022 - 10:28 UTC
Update
We are continuing to investigate this issue.
Posted Aug 01, 2022 - 09:17 UTC
Investigating
We are currently investigating this issue.
Posted Aug 01, 2022 - 09:15 UTC
This incident affected: JEMH Cloud (Inbound mail processing, Outbound notification processing, User Interface).