Open Playback · Free & MCP-native

Production failure memory
for your AI coding agent

Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks

all abuse event api gateway auth background degradation background jobs bgp misconfiguration cache cache stampede capacity shortfall

MCP

Connect via MCP

Drop this into Claude Desktop or Cursor. Then ask your agent about cache stampedes, BGP failures, or DNS outages.

"open-playback": {
  "type": "http",
  "url": "https://open-mcp.aftermath.sh/"
}

32 encores

Sorted by date

Honeycomb
Oct 12, 2018
SEV-1
Two partial API outages from RDS CPU saturation and runaway cache-refresh queries
On October 4, 2018, Honeycomb suffered a partial API outage as RDS MySQL stalled at roughly 90% CPU. A second, less severe incident hit on October 11. In both cases the proximate trigger was a small event coinciding with a baseline CPU level that had drifted up to 30 to 40% over months, leaving no headroom. The deeper cause was cache-refresh queries (rate limit, sampling, blacklist) that fanned out into the thousands of concurrent queries instead of running one at a time. The team shipped a cache fix to coalesce refreshes and upgraded the RDS instance from m4.xlarge to m5.2xlarge during a brief maintenance window on October 12, leaving service stable.
7d 7h 38mNot disclosed affectedAPI customers globally during two ~1-hour windows plus a 2-minute maintenance interruption
cachecapacity shortfalldatabasethundering herd
Honeycomb
May 4, 2018
SEV-0
Sudden RDS MySQL performance collapse causes near-total Honeycomb outage
On May 3, 2018, the production RDS MySQL instance backing Honeycomb's API experienced a sudden and dramatic performance collapse, with P95 query time jumping from 11 ms to over 1000 ms in roughly 20 seconds while write throughput dropped from 780 ops per second to 5. The application stack reacted by saturating the connection pool with retries, and the service was almost entirely unavailable for approximately 24 hours. Only about 15% of incoming events were successfully stored during the outage, though customer-side buffering allowed many to be replayed afterward. The team initially worried that a Go application bug might be hammering the database, but later confirmed the root issue was at the database layer.
1dNot disclosed affectedglobal; nearly complete service outage
cascading failureconnection pool exhaustiondata lossdatabase
Honeycomb
Oct 17, 2017
SEV-1
Kafka 0.10 controller bug after ZooKeeper network partition causes write loss across four partitions
A ZooKeeper network partition the night before silently left the Kafka cluster in a fragile state. Around 6 a.m. PDT the next morning, the Kafka controller did something that exposed the latent damage and end-to-end checks began failing on four partitions. Engineers spent hours debugging a split-brain condition between restarted and un-restarted brokers, with offsets on data nodes drifting ahead of acknowledged offsets because Kafka kept accepting writes ZooKeeper had not acknowledged. Recovery required restarting all brokers and manually resetting offsets on data nodes. The underlying issue was attributed to Kafka bugs fixed in 0.10.2.1.
6hNot disclosed affected33% of customers actively sending data; 4 of N Kafka partitions
cascading failuredata lossnetworkqueue or stream
Stripe
Dec 17, 2015
SEV-1
API fully unavailable for 44 minutes after failures in an internal event queueing system cascaded to the Stripe API, Checkout, and Dashboard
On December 17, 2015, failures in Stripe's internal event queueing system caused a cascade that degraded the Stripe API for 9 minutes and then took it fully offline for an additional 44 minutes. During the outage window, merchants could not process payments via the API, Checkout, or the Dashboard. Stripe published an initial incident summary shortly after recovery while continuing to investigate the deeper root cause.
53mNot disclosed affectedAll Stripe API users globally; Stripe API, Checkout, and Dashboard were fully unavailable during the complete outage window
background degradationcascading failurefull outagequeue or stream
Stripe
Oct 8, 2015
SEV-1
API degraded 90 minutes after automated tooling misread an index modification as two separate operations, causing premature index deletion
On October 8, 2015, an application developer submitted a request to modify an existing database index in order to improve API performance. Stripe's internal schema-management library misinterpreted the modification as two separate operations — adding a new index and deleting the old one — rather than a single in-place update. A database operator processing the change queue executed the deletion first, removing a critical index from all replicas simultaneously. The missing index caused a set of API endpoints to slow down and time out, and the resulting worker-pool starvation cascaded into a broad API outage lasting roughly 90 minutes. Recovery required rebuilding the index and deploying a temporary code patch to bypass the missing index.
1h 39mNot disclosed affectedHundreds of thousands of Stripe merchants globally; approximately two-thirds of all API operations failed
configuration errordatabasehotfixhuman error

Production failure memory for your AI coding agent

Two partial API outages from RDS CPU saturation and runaway cache-refresh queries

Sudden RDS MySQL performance collapse causes near-total Honeycomb outage

Kafka 0.10 controller bug after ZooKeeper network partition causes write loss across four partitions

API fully unavailable for 44 minutes after failures in an internal event queueing system cascaded to the Stripe API, Checkout, and Dashboard

API degraded 90 minutes after automated tooling misread an index modification as two separate operations, causing premature index deletion

Production failure memory
for your AI coding agent