Open Playback · Free & MCP-native
Production failure memory
for your AI coding agent
Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks
32 encores
Sorted by date
- SEV-1
Honeycomb
Oct 12, 2018
Two partial API outages from RDS CPU saturation and runaway cache-refresh queries
On October 4, 2018, Honeycomb suffered a partial API outage as RDS MySQL stalled at roughly 90% CPU. A second, less severe incident hit on October 11. In both cases the proximate trigger was a small event coinciding with a baseline CPU level that had drifted up to 30 to 40% over months, leaving no headroom. The deeper cause was cache-refresh queries (rate limit, sampling, blacklist) that fanned out into the thousands of concurrent queries instead of running one at a time. The team shipped a cache fix to coalesce refreshes and upgraded the RDS instance from m4.xlarge to m5.2xlarge during a brief maintenance window on October 12, leaving service stable.
7d 7h 38mNot disclosed affectedAPI customers globally during two ~1-hour windows plus a 2-minute maintenance interruptioncachecapacity shortfalldatabasethundering herd - SEV-0
Honeycomb
May 4, 2018
Sudden RDS MySQL performance collapse causes near-total Honeycomb outage
On May 3, 2018, the production RDS MySQL instance backing Honeycomb's API experienced a sudden and dramatic performance collapse, with P95 query time jumping from 11 ms to over 1000 ms in roughly 20 seconds while write throughput dropped from 780 ops per second to 5. The application stack reacted by saturating the connection pool with retries, and the service was almost entirely unavailable for approximately 24 hours. Only about 15% of incoming events were successfully stored during the outage, though customer-side buffering allowed many to be replayed afterward. The team initially worried that a Go application bug might be hammering the database, but later confirmed the root issue was at the database layer.
1dNot disclosed affectedglobal; nearly complete service outagecascading failureconnection pool exhaustiondata lossdatabase - SEV-1
Honeycomb
Oct 17, 2017
Kafka 0.10 controller bug after ZooKeeper network partition causes write loss across four partitions
A ZooKeeper network partition the night before silently left the Kafka cluster in a fragile state. Around 6 a.m. PDT the next morning, the Kafka controller did something that exposed the latent damage and end-to-end checks began failing on four partitions. Engineers spent hours debugging a split-brain condition between restarted and un-restarted brokers, with offsets on data nodes drifting ahead of acknowledged offsets because Kafka kept accepting writes ZooKeeper had not acknowledged. Recovery required restarting all brokers and manually resetting offsets on data nodes. The underlying issue was attributed to Kafka bugs fixed in 0.10.2.1.
6hNot disclosed affected33% of customers actively sending data; 4 of N Kafka partitionscascading failuredata lossnetworkqueue or stream - SEV-1
Stripe
Dec 17, 2015
API fully unavailable for 44 minutes after failures in an internal event queueing system cascaded to the Stripe API, Checkout, and Dashboard
On December 17, 2015, failures in Stripe's internal event queueing system caused a cascade that degraded the Stripe API for 9 minutes and then took it fully offline for an additional 44 minutes. During the outage window, merchants could not process payments via the API, Checkout, or the Dashboard. Stripe published an initial incident summary shortly after recovery while continuing to investigate the deeper root cause.
53mNot disclosed affectedAll Stripe API users globally; Stripe API, Checkout, and Dashboard were fully unavailable during the complete outage windowbackground degradationcascading failurefull outagequeue or stream - SEV-1
Stripe
Oct 8, 2015
API degraded 90 minutes after automated tooling misread an index modification as two separate operations, causing premature index deletion
On October 8, 2015, an application developer submitted a request to modify an existing database index in order to improve API performance. Stripe's internal schema-management library misinterpreted the modification as two separate operations — adding a new index and deleting the old one — rather than a single in-place update. A database operator processing the change queue executed the deletion first, removing a critical index from all replicas simultaneously. The missing index caused a set of API endpoints to slow down and time out, and the resulting worker-pool starvation cascaded into a broad API outage lasting roughly 90 minutes. Recovery required rebuilding the index and deploying a temporary code patch to bypass the missing index.
1h 39mNot disclosed affectedHundreds of thousands of Stripe merchants globally; approximately two-thirds of all API operations failedconfiguration errordatabasehotfixhuman error
Your own encores
Turn your incidents into structured post-mortems.
Aftermath runs the whole show: Live Stage captures the incident, Encore writes the post-mortem. Private and structured, like these, but yours.
Get early access