Back to Open Playback
SEV-1public access

API degraded 90 minutes after automated tooling misread an index modification as two separate operations, causing premature index deletion

Stripe · Source

Started
Oct 8, 2015
Duration
1h 39m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
Hundreds of thousands of Stripe merchants globally; approximately two-thirds of all API operations failed
Services
payments-api, stripe-dashboard, checkout, database
configuration errordatabasehotfixhuman errormanual actionmissing indexpartial outage

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On October 8, 2015, an application developer submitted a request to modify an existing database index in order to improve API performance. Stripe's internal schema-management library misinterpreted the modification as two separate operations — adding a new index and deleting the old one — rather than a single in-place update. A database operator processing the change queue executed the deletion first, removing a critical index from all replicas simultaneously. The missing index caused a set of API endpoints to slow down and time out, and the resulting worker-pool starvation cascaded into a broad API outage lasting roughly 90 minutes. Recovery required rebuilding the index and deploying a temporary code patch to bypass the missing index.

Impact

Roughly two-thirds of all Stripe API requests failed or timed out for approximately 90 minutes. The Stripe Dashboard was also unavailable. Businesses relying on Stripe for real-time payment processing experienced failed checkouts and blocked transactions during the window.

Root cause

An application developer modified an existing database index description rather than adding or removing one. Stripe's schema-management tooling contained a bug that misread an in-place modification as two distinct operations — creation of a replacement index and deletion of the original — and queued them separately. The database operator had no mechanism to detect that the two tickets were interdependent, and processed the deletion ticket before the new index existed. The old index was dropped from all replicas simultaneously with no canary step, immediately degrading queries that depended on it. Worker-pool exhaustion then cascaded the localized query slowdown into a near-total API outage.

Resolution

Engineers rebuilt the deleted index in the background while simultaneously developing a code patch that redirected affected API endpoints away from the missing index, allowing the API to operate in a slightly degraded state. The index rebuild completed and normal operation was restored by 01:45 UTC.

Lessons

  • Schema management tooling that generates database operations must preserve the semantic intent of the original change; misrepresenting a modification as a delete-plus-create is a correctness failure that can produce severe operational consequences.
  • Database changes that have ordering dependencies should be encoded as a single atomic unit of work, not as separate queue entries that an operator must sequence correctly.
  • Dropping a critical index from all replicas simultaneously — rather than performing a canary change on one replica first — eliminates the ability to detect impact before it is global.
  • API worker pools represent a shared resource; a slow query on one endpoint can starve all endpoints, so isolation between endpoint classes is an important resilience property.
  • When operators process changes from a queue, the tooling should surface dependency relationships explicitly; relying on operators to infer ordering from context is fragile.

Action items

  • Fix the schema management library so that a modification to an existing index description is encoded as a single operation, not as an independent deletion and creation.
  • Add dependency tracking to the database change queue so that linked tickets are surfaced together and cannot be processed out of order.
  • Implement canary index operations that apply changes to one replica first and hold for a health check before propagating to the full cluster.
  • Review all API endpoints for worker-pool isolation so that slow queries on one code path cannot exhaust shared resources and cascade to unrelated endpoints.
  • Instrument query plans in production monitoring so that the absence or degradation of a critical index triggers an alert before user impact is measurable.