Service Outage detected
Incident Report for The Plugin People
Resolved
Larger app nodes appear to have chewed through the webhook backlog, mail flowing.
Posted Jan 31, 2020 - 17:28 UTC
Monitoring
That was gnarly! We believe stability has now been restored - the en-masse webhook updates from Jira have subsided and we're back to BAU. No code changes caused this or were required to fix it. We've upped the database size again, to allow for increased capacity in similar situations in future, with work underway to further improve resilience. Again, apologies to those affected - we believe no data was lost here. We continue to monitor.
Posted Jan 31, 2020 - 17:22 UTC
Update
We believe the cause of the problems is related to database connection exhaustion driven by repeat high volume 'missed' webhooks from Jira cloud. Every time a new node come up, its buried by events. We have some options to increase capacity which will be done soon.
Posted Jan 31, 2020 - 07:16 UTC
Update
Mail is being sent and receive periodically, slowly. UI access is erratic. Root cause is still elusive, will continue in 8hrs - must sleep now.
Posted Jan 30, 2020 - 23:14 UTC
Update
Sorry for the long running incident. We haven't made any code or infrastructure changes to cause this problem - its data-driven. We are working to mitigate the problem and will post updates when we have them.
Posted Jan 30, 2020 - 20:26 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 30, 2020 - 15:44 UTC
Investigating
We are currently investigating this issue.
Posted Jan 30, 2020 - 13:48 UTC
This incident affected: JEMH Cloud (Inbound mail processing, Outbound notification processing).