What happened
At 02:16 UTC on 8th January 2026 a backlog of unprocessed issue event webhooks started to build up in the US/Global region (customers using the EU data residency region were unaffected). Office hours started at 08:30 UTC at which time we reviewed the alert notices and started investigations and saw increasing delays in inbound email processing. At 10:11 UTC we logged a statuspage report to advise customers of expected delays in processing.
We found that response times for requests from the app to the Jira REST API were longer than usual, sometimes exceeding 30 seconds that caused background tasks to overrun their scheduled execution times.
As all per-tenant background tasks started to overrun, the task scheduling scheduling of new work slowed down resulting in triggering of excessive database activity that contributed to overall database load, further impacting other operations.
As the database became overloaded with queries from each compute node’s task scheduler relating to acquiring and releasing task scheduling table locks, compute node health checks began to fail as they depend on the node’s task scheduler being responsive. Once compute node health checks started to fail, compute nodes started to be forcibly replaced with new instances.
The initialization of a new compute node was dependent on its task scheduler starting up successfully. As each node’s scheduler instance could not acquire a table lock on the shared database the compute node would fail to complete initialization and would be deemed unhealthy, leading to its termination and recreation of a compute node. This caused a loop of background task compute node termination and initialization.
At 17:00 UTC the underlying source of the problem cleared without any code or infrastructure changes by us and job nodes began to rapidly eat through work and was back to normal by 1800 for most customers. A few higher volume customers had more work backlogged and this took a little longer to get through.
Root cause
We did not identify a single specific API endpoint to be at fault, rather that responses from Atlassian servers in general were slower. No infrastructure or code changes were performed by us before the incident occurred, therefore it is most likely that calls blocking within Atlassian instances caused the job processing to lock-up.
Impact assessment
Our team assessed the impact of the disruption and determined that:
What actions we are taking
Slow responses to external APIs should not have an impact on the running of background tasks. To address this, we are changing how we schedule tasks such as mail retrieval and event processing (for outbound mail). That work is in progress and should be complete in the next week or two. We are also working on fine tuning how the app makes HTTP requests, both in terms of timeouts and connection pool management. The reliability of the app is our top priority and we will redouble our efforts to meet our customers expectations in this matter.