SEV-0public access

API severely degraded twice in one day: a minor database version upgrade introduced a latent failover bug that triggered under a rare multi-node stall condition, and the rollback intended to fix it interacted with a recent config change to cause a second distinct outage

Stripe · Source

Started: Jul 10, 2019
Duration: 6h 12m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: All Stripe API users globally; a substantial majority of API requests failed during both degradation windows (16:35–17:02 UTC and 21:14–22:47 UTC)
Services: payments-api, database, stripe-dashboard

background degradationcascading failureconfig driftconfiguration fixdatabasedependency version mismatchfull outagerollback

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On July 10, 2019, the Stripe API experienced two separate periods of severe degradation. The first, lasting 27 minutes, was caused by a latent bug introduced three months earlier in a minor database version upgrade: under a rare condition where multiple nodes stall simultaneously, the new version's failover protocol could not elect a primary, leaving an entire shard unable to accept writes. Because this shard underpinned a wide range of API operations, compute resources were rapidly exhausted across the API. After restarting the cluster to force a new election, Stripe recovered and then rolled back the database version as a precaution. That rollback triggered the second outage: the rolled-back version interacted unexpectedly with a recent configuration change to the production shards, causing CPU starvation on all affected shards. The second outage lasted 93 minutes and required engineers to identify the config interaction, apply a corrected configuration, and restart the cluster again.

Impact

A substantial majority of Stripe API requests failed during two windows: 16:35–17:02 UTC (27 minutes) and 21:14–22:47 UTC (93 minutes). Businesses relying on Stripe for real-time payment processing experienced failed transactions during both windows. Stripe notified customers with five or more failed POST requests by email after the event.

Root cause

Three months before the incident, Stripe upgraded their database cluster to a new minor version that introduced a subtle bug in the failover election protocol, detectable only when multiple nodes stall simultaneously. Four days before the incident, two nodes in one critical shard stalled for undetermined reasons; these nodes stopped reporting replication lag but continued passing active health checks, masking the degraded state. On July 10, the original primary for this shard failed, triggering a leader election — but the stalled nodes prevented the cluster from completing the election, leaving the shard without a writable primary. Because the shard supported widespread application writes including core API paths, the unavailability cascaded into compute starvation across the entire API. After engineering restarted the cluster to restore election (first-period remediation), they rolled back the database version to eliminate the election bug. That rollback interacted with a recently-introduced configuration change on the production shards, producing CPU starvation across all affected shards and causing the second, distinct outage.

Resolution

First period: Engineers restarted all nodes in the affected database cluster at 17:00 UTC, restoring a successful leader election. The API fully recovered by 17:02 UTC. Second period: Engineers identified that the rolled-back database version was interacting with a recent configuration change, applied the corrected production configuration, restarted the cluster's nodes at 22:34 UTC, and verified full recovery by 22:47 UTC.

Timeline

12:00MITIG
Stripe upgraded database clusters to a new minor version, performing thorough QA testing and a phased production rollout from less-critical to more-critical clusters. The new version introduced a latent election protocol bug that only manifests when multiple nodes stall simultaneously.
database
12:00MITIG
Two nodes in a critical, widely-used database shard stalled for undetermined reasons. The nodes stopped emitting replication lag metrics but continued passing active health checks, masking the degraded cluster state.
database
16:35DETECT
The primary node for the affected database shard failed. The cluster attempted leader election but could not complete it due to the presence of the two stalled nodes interacting with the new version's election protocol. The shard became unable to accept writes.
database
16:35DETECT
Applications writing to the shard began timing out. Compute resources across the API were starved as write operations queued and retried, cascading into severe API degradation across all endpoints.
payments-api
16:36DETECT
Automated monitoring detected the failed election and paged the on-call team. Incident response began within two minutes.
database
16:50INVEST
Engineers determined the cluster was unable to elect a primary. Given the novel and complex failure mode, they diagnosed the interaction between the stalled nodes and the election protocol bug.
database
17:00MITIG
Engineers restarted all nodes in the database cluster, forcing a successful primary election.
database
17:02RESOLV
The Stripe API fully recovered from the first degradation period.
payments-api
20:13INVEST
During the root cause investigation, engineers identified a likely code path in the new database version's election protocol responsible for the bug.
database
20:42MITIG
As a precautionary remediation, Stripe rolled back the affected cluster to the previous stable database version. The rollback was deployed within four minutes.
database
21:14DETECT
Automated alerts fired: multiple shards in the cluster became unavailable, including the shard from the first incident. Symptoms appeared identical to the first outage. The API started returning errors, beginning the second degradation period.
database
21:26INVEST
Engineers identified that the second outage had a different root cause: the rolled-back database version was interacting with a recently-introduced configuration change on the production shards, causing CPU starvation. Applying the necessary configuration fix was slowed by resource contention.
database
22:34MITIG
Engineers successfully rolled out the corrected production configuration and restarted the affected cluster nodes.
database
22:47RESOLV
The Stripe API fully recovered after engineers verified cluster health and ramped traffic back up, prioritizing user-initiated API requests.
payments-api

Attribution

Stripe

By Database Infrastructure

Published Jul 10, 2019

View original source

Lessons

A phased rollout that proceeds without triggering a failure mode provides false confidence; rare trigger conditions — such as simultaneous node stalls — may not manifest during any reasonable staging period.
Health checks that confirm a node is reachable are not equivalent to health checks that confirm a node is functioning correctly; stalled nodes with degraded replication lag should not pass active health checks.
When a service has many dependents and goes latent rather than failing hard, the cascading impact on shared compute resources can be far worse than a clean hard failure.
A rollback in a distributed system is not a true revert to a prior state; the rolled-back software runs against a production environment that has changed since the previous version was in service, and interactions with recent configuration changes can create entirely new failure modes.
When an incident recurs with the same symptoms, the natural assumption is recurrence of the same problem; the second outage demonstrated that identical symptoms can have a different root cause, and mitigation playbooks should include explicit checks rather than just pattern-matching.

Action items

Add monitoring that alerts when any database node stops emitting replication lag metrics, even if the node continues to respond as healthy to active checks.
Implement a circuit breaker that fires when a shard enters a state consistent with an election failure before the primary has officially failed.
Introduce additional fault isolation between individual database shards and the API worker pool to prevent a single shard's unavailability from exhausting shared compute resources.
Add circuit-breaking on repeated failed operations targeting specific database clusters, limiting the retry amplification that accelerates resource starvation.
Introduce tooling safeguards and rollback verification procedures that check for known configuration-version interactions before applying a database version rollback under incident conditions.
Work with the database software maintainers to produce a fix for the underlying election protocol bug that manifests when multiple nodes stall simultaneously.