Back to Open Playback
SEV-1public access

API fully unavailable for 44 minutes after failures in an internal event queueing system cascaded to the Stripe API, Checkout, and Dashboard

Stripe · Source

Started
Dec 17, 2015
Duration
53m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
All Stripe API users globally; Stripe API, Checkout, and Dashboard were fully unavailable during the complete outage window
Services
payments-api, stripe-dashboard, checkout, event-queue
background degradationcascading failurefull outagequeue or streamrestart

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On December 17, 2015, failures in Stripe's internal event queueing system caused a cascade that degraded the Stripe API for 9 minutes and then took it fully offline for an additional 44 minutes. During the outage window, merchants could not process payments via the API, Checkout, or the Dashboard. Stripe published an initial incident summary shortly after recovery while continuing to investigate the deeper root cause.

Impact

The Stripe API, Checkout, and Dashboard were partially degraded for 9 minutes and completely unavailable for 44 minutes on December 17, 2015. Merchants worldwide were unable to accept payments or access their account during the full-outage window.

Root cause

Failures in an internal event queueing service propagated to dependent API services, causing full degradation. The specific mechanism by which queue failures cascaded to total API unavailability was under investigation at the time of Stripe's public communications; Stripe did not publish a detailed root cause narrative for this event.

Resolution

Stripe restored service by stabilizing or restarting the affected event queueing components. Normal API operation resumed by approximately 00:53 UTC.

Lessons

  • Internal queueing systems that are deeply coupled to live API availability create a single point of failure; failures in event infrastructure should degrade gracefully rather than cascading to customer-facing services.
  • Publishing an initial incident summary while investigation continues is the right practice for trust and transparency, even if the root cause is not yet fully known.

Action items

  • Introduce circuit-breaking between the event queueing system and the API serving layer so that queue failures do not propagate directly to customer-facing request handling.
  • Improve fault isolation to ensure that internal infrastructure failures can cause degraded-but-functional modes rather than complete outages.