SEV-1public access

API degraded 90 minutes after automated tooling misread an index modification as two separate operations, causing premature index deletion

Stripe · Source

Started: Oct 8, 2015
Duration: 1h 39m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: Hundreds of thousands of Stripe merchants globally; approximately two-thirds of all API operations failed
Services: payments-api, stripe-dashboard, checkout, database

configuration errordatabasehotfixhuman errormanual actionmissing indexpartial outage

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On October 8, 2015, an application developer submitted a request to modify an existing database index in order to improve API performance. Stripe's internal schema-management library misinterpreted the modification as two separate operations — adding a new index and deleting the old one — rather than a single in-place update. A database operator processing the change queue executed the deletion first, removing a critical index from all replicas simultaneously. The missing index caused a set of API endpoints to slow down and time out, and the resulting worker-pool starvation cascaded into a broad API outage lasting roughly 90 minutes. Recovery required rebuilding the index and deploying a temporary code patch to bypass the missing index.

Impact

Roughly two-thirds of all Stripe API requests failed or timed out for approximately 90 minutes. The Stripe Dashboard was also unavailable. Businesses relying on Stripe for real-time payment processing experienced failed checkouts and blocked transactions during the window.

Root cause

An application developer modified an existing database index description rather than adding or removing one. Stripe's schema-management tooling contained a bug that misread an in-place modification as two distinct operations — creation of a replacement index and deletion of the original — and queued them separately. The database operator had no mechanism to detect that the two tickets were interdependent, and processed the deletion ticket before the new index existed. The old index was dropped from all replicas simultaneously with no canary step, immediately degrading queries that depended on it. Worker-pool exhaustion then cascaded the localized query slowdown into a near-total API outage.

Resolution

Engineers rebuilt the deleted index in the background while simultaneously developing a code patch that redirected affected API endpoints away from the missing index, allowing the API to operate in a slightly degraded state. The index rebuild completed and normal operation was restored by 01:45 UTC.

Timeline

21:30MITIG
An application developer submitted a request to modify an existing database index as part of API performance work. The schema management library recorded this as two separate change tickets: create a new index, delete the old one.
database
00:06DETECT
A database operator processing the open change queue saw the original index flagged as no longer needed and followed standard removal procedure, deleting it from all replicas simultaneously. The replacement index did not yet exist.
database
00:06DETECT
Requests to a set of API endpoints began slowing down and timing out after the missing index caused full collection scans. Worker-pool starvation propagated the slowdown to all API endpoints.
payments-api
00:08DETECT
The on-call engineer was paged and acknowledged the alert within two minutes.
payments-api
00:10INVEST
Engineers correlated the API degradation to the recent index deletion.
database
00:17MITIG
The response team began rebuilding the deleted index in the background.
database
00:24INVEST
Engineers estimated the index rebuild would take over an hour and began pursuing parallel recovery paths, including a code-level fix to bypass the missing index.
payments-api
01:30MITIG
A code patch was deployed that modified affected API endpoints to avoid the missing index, allowing requests to succeed in a slightly degraded state. API availability began recovering.
payments-api
01:45RESOLV
The index rebuild completed. All services returned to normal operation with full query performance restored.
payments-api

Attribution

Stripe

By Infrastructure / Database Operations

Published Oct 8, 2015

View original source

Lessons

Schema management tooling that generates database operations must preserve the semantic intent of the original change; misrepresenting a modification as a delete-plus-create is a correctness failure that can produce severe operational consequences.
Database changes that have ordering dependencies should be encoded as a single atomic unit of work, not as separate queue entries that an operator must sequence correctly.
Dropping a critical index from all replicas simultaneously — rather than performing a canary change on one replica first — eliminates the ability to detect impact before it is global.
API worker pools represent a shared resource; a slow query on one endpoint can starve all endpoints, so isolation between endpoint classes is an important resilience property.
When operators process changes from a queue, the tooling should surface dependency relationships explicitly; relying on operators to infer ordering from context is fragile.

Action items

Fix the schema management library so that a modification to an existing index description is encoded as a single operation, not as an independent deletion and creation.
Add dependency tracking to the database change queue so that linked tickets are surfaced together and cannot be processed out of order.
Implement canary index operations that apply changes to one replica first and hold for a health check before propagating to the full cluster.
Review all API endpoints for worker-pool isolation so that slow queries on one code path cannot exhaust shared resources and cascade to unrelated endpoints.
Instrument query plans in production monitoring so that the absence or degradation of a critical index triggers an alert before user impact is measurable.