Open Playback · Free & MCP-native
Production failure memory
for your AI coding agent
Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks
2 encores
Sorted by date
- SEV-0
Stripe
Jul 10, 2019
API severely degraded twice in one day: a minor database version upgrade introduced a latent failover bug that triggered under a rare multi-node stall condition, and the rollback intended to fix it interacted with a recent config change to cause a second distinct outage
On July 10, 2019, the Stripe API experienced two separate periods of severe degradation. The first, lasting 27 minutes, was caused by a latent bug introduced three months earlier in a minor database version upgrade: under a rare condition where multiple nodes stall simultaneously, the new version's failover protocol could not elect a primary, leaving an entire shard unable to accept writes. Because this shard underpinned a wide range of API operations, compute resources were rapidly exhausted across the API. After restarting the cluster to force a new election, Stripe recovered and then rolled back the database version as a precaution. That rollback triggered the second outage: the rolled-back version interacted unexpectedly with a recent configuration change to the production shards, causing CPU starvation on all affected shards. The second outage lasted 93 minutes and required engineers to identify the config interaction, apply a corrected configuration, and restart the cluster again.
6h 12mNot disclosed affectedAll Stripe API users globally; a substantial majority of API requests failed during both degradation windows (16:35–17:02 UTC and 21:14–22:47 UTC)background degradationcascading failureconfig driftconfiguration fix - SEV-1
Stripe
Dec 17, 2015
API fully unavailable for 44 minutes after failures in an internal event queueing system cascaded to the Stripe API, Checkout, and Dashboard
On December 17, 2015, failures in Stripe's internal event queueing system caused a cascade that degraded the Stripe API for 9 minutes and then took it fully offline for an additional 44 minutes. During the outage window, merchants could not process payments via the API, Checkout, or the Dashboard. Stripe published an initial incident summary shortly after recovery while continuing to investigate the deeper root cause.
53mNot disclosed affectedAll Stripe API users globally; Stripe API, Checkout, and Dashboard were fully unavailable during the complete outage windowbackground degradationcascading failurefull outagequeue or stream
Your own encores
Turn your incidents into structured post-mortems.
Aftermath runs the whole show: Live Stage captures the incident, Encore writes the post-mortem. Private and structured, like these, but yours.
Get early access