API severely degraded twice in one day: a minor database version upgrade introduced a latent failover bug that triggered under a rare multi-node stall condition, and the rollback intended to fix it interacted with a recent config change to cause a second distinct outage
Stripe · Source
- Started
- Jul 10, 2019
- Duration
- 6h 12m
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- All Stripe API users globally; a substantial majority of API requests failed during both degradation windows (16:35–17:02 UTC and 21:14–22:47 UTC)
- Services
- payments-api, database, stripe-dashboard
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
On July 10, 2019, the Stripe API experienced two separate periods of severe degradation. The first, lasting 27 minutes, was caused by a latent bug introduced three months earlier in a minor database version upgrade: under a rare condition where multiple nodes stall simultaneously, the new version's failover protocol could not elect a primary, leaving an entire shard unable to accept writes. Because this shard underpinned a wide range of API operations, compute resources were rapidly exhausted across the API. After restarting the cluster to force a new election, Stripe recovered and then rolled back the database version as a precaution. That rollback triggered the second outage: the rolled-back version interacted unexpectedly with a recent configuration change to the production shards, causing CPU starvation on all affected shards. The second outage lasted 93 minutes and required engineers to identify the config interaction, apply a corrected configuration, and restart the cluster again.
Impact
A substantial majority of Stripe API requests failed during two windows: 16:35–17:02 UTC (27 minutes) and 21:14–22:47 UTC (93 minutes). Businesses relying on Stripe for real-time payment processing experienced failed transactions during both windows. Stripe notified customers with five or more failed POST requests by email after the event.
Root cause
Three months before the incident, Stripe upgraded their database cluster to a new minor version that introduced a subtle bug in the failover election protocol, detectable only when multiple nodes stall simultaneously. Four days before the incident, two nodes in one critical shard stalled for undetermined reasons; these nodes stopped reporting replication lag but continued passing active health checks, masking the degraded state. On July 10, the original primary for this shard failed, triggering a leader election — but the stalled nodes prevented the cluster from completing the election, leaving the shard without a writable primary. Because the shard supported widespread application writes including core API paths, the unavailability cascaded into compute starvation across the entire API. After engineering restarted the cluster to restore election (first-period remediation), they rolled back the database version to eliminate the election bug. That rollback interacted with a recently-introduced configuration change on the production shards, producing CPU starvation across all affected shards and causing the second, distinct outage.
Resolution
First period: Engineers restarted all nodes in the affected database cluster at 17:00 UTC, restoring a successful leader election. The API fully recovered by 17:02 UTC. Second period: Engineers identified that the rolled-back database version was interacting with a recent configuration change, applied the corrected production configuration, restarted the cluster's nodes at 22:34 UTC, and verified full recovery by 22:47 UTC.
Lessons
- A phased rollout that proceeds without triggering a failure mode provides false confidence; rare trigger conditions — such as simultaneous node stalls — may not manifest during any reasonable staging period.
- Health checks that confirm a node is reachable are not equivalent to health checks that confirm a node is functioning correctly; stalled nodes with degraded replication lag should not pass active health checks.
- When a service has many dependents and goes latent rather than failing hard, the cascading impact on shared compute resources can be far worse than a clean hard failure.
- A rollback in a distributed system is not a true revert to a prior state; the rolled-back software runs against a production environment that has changed since the previous version was in service, and interactions with recent configuration changes can create entirely new failure modes.
- When an incident recurs with the same symptoms, the natural assumption is recurrence of the same problem; the second outage demonstrated that identical symptoms can have a different root cause, and mitigation playbooks should include explicit checks rather than just pattern-matching.
Action items
- Add monitoring that alerts when any database node stops emitting replication lag metrics, even if the node continues to respond as healthy to active checks.
- Implement a circuit breaker that fires when a shard enters a state consistent with an election failure before the primary has officially failed.
- Introduce additional fault isolation between individual database shards and the API worker pool to prevent a single shard's unavailability from exhausting shared compute resources.
- Add circuit-breaking on repeated failed operations targeting specific database clusters, limiting the retry amplification that accelerates resource starvation.
- Introduce tooling safeguards and rollback verification procedures that check for known configuration-version interactions before applying a database version rollback under incident conditions.
- Work with the database software maintainers to produce a fix for the underlying election protocol bug that manifests when multiple nodes stall simultaneously.