Email processing delays

Incident Report for The Plugin People

Postmortem

What happened

At 02:16 UTC on 8th January 2026 a backlog of unprocessed issue event webhooks started to build up in the US/Global region (customers using the EU data residency region were unaffected). Office hours started at 08:30 UTC at which time we reviewed the alert notices and started investigations and saw increasing delays in inbound email processing. At 10:11 UTC we logged a statuspage report to advise customers of expected delays in processing.

We found that response times for requests from the app to the Jira REST API were longer than usual, sometimes exceeding 30 seconds that caused background tasks to overrun their scheduled execution times.

As all per-tenant background tasks started to overrun, the task scheduling scheduling of new work slowed down resulting in triggering of excessive database activity that contributed to overall database load, further impacting other operations.

As the database became overloaded with queries from each compute node’s task scheduler relating to acquiring and releasing task scheduling table locks, compute node health checks began to fail as they depend on the node’s task scheduler being responsive. Once compute node health checks started to fail, compute nodes started to be forcibly replaced with new instances.

The initialization of a new compute node was dependent on its task scheduler starting up successfully. As each node’s scheduler instance could not acquire a table lock on the shared database the compute node would fail to complete initialization and would be deemed unhealthy, leading to its termination and recreation of a compute node. This caused a loop of background task compute node termination and initialization.

At 17:00 UTC the underlying source of the problem cleared without any code or infrastructure changes by us and job nodes began to rapidly eat through work and was back to normal by 1800 for most customers. A few higher volume customers had more work backlogged and this took a little longer to get through.

Root cause

We did not identify a single specific API endpoint to be at fault, rather that responses from Atlassian servers in general were slower. No infrastructure or code changes were performed by us before the incident occurred, therefore it is most likely that calls blocking within Atlassian instances caused the job processing to lock-up.

Impact assessment

Our team assessed the impact of the disruption and determined that:

  • There was no data loss, issue events continued to be received and pre-processed as normal, no event data was lost
  • There were variable levels of delay on processing inbound mail and outbound notifications. Some hosts saw little/no impact if their queued data volumes were not large, hosts with larger volumes of changes were more impacted (their traffic took longer to get back to real-time processing).

What actions we are taking

Slow responses to external APIs should not have an impact on the running of background tasks. To address this, we are changing how we schedule tasks such as mail retrieval and event processing (for outbound mail). That work is in progress and should be complete in the next week or two. We are also working on fine tuning how the app makes HTTP requests, both in terms of timeouts and connection pool management. The reliability of the app is our top priority and we will redouble our efforts to meet our customers expectations in this matter.

Posted Jan 14, 2026 - 17:04 UTC

Resolved

High volume event backlog cleared overnight. We are progressing a replacement of the scheduling framework to help in future similar scneanrios.
Posted Jan 09, 2026 - 06:49 UTC

Monitoring

The majority of customers are now at realtime processing, only customers with 'large' volumes have backlogs that continue to be worked through.
Posted Jan 08, 2026 - 19:30 UTC

Investigating

We are continuing to investigate this issue.
Posted Jan 08, 2026 - 19:28 UTC

Monitoring

Job scheduling is back to normal, backlog is being processed apace, we continue to monitor.
Posted Jan 08, 2026 - 19:15 UTC

Update

We are continuing to investigate this issue.
Posted Jan 08, 2026 - 19:14 UTC

Update

We have paused processing on some high volume hosts, the result is that the jobs are now running, inbound mail is likely up to date, the backlog of events is being worked through.
Posted Jan 08, 2026 - 15:53 UTC

Update

For clarity this affects US region instances only, EU (Frankfurt) is unaffected.
Posted Jan 08, 2026 - 13:19 UTC

Update

The team continues to investigate, we see authorization calls to Atlassian hosts taking longer than expected, this delay is impacting our job scheduling responsible for processing inbound mail and sending notifications - there don't appear to be any Atlassian statuspage problems right now. Backlogged events/mail will be processed but not in real time. We will provide an update within 2hrs.
Posted Jan 08, 2026 - 12:32 UTC

Investigating

We are currently investigating this issue.
Posted Jan 08, 2026 - 10:11 UTC
This incident affected: Enterprise Mail Handler for Jira Cloud (JEMHC) (Inbound mail processing, Outbound notification processing).