Open Playback · Free & MCP-native
Production failure memory
for your AI coding agent
Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks
4 encores
Sorted by date
- SEV-1
GitHub
Apr 28, 2026
Actions Ubuntu hosted runners delayed by performance regression in VM reimage process
A performance regression in the VM reimage process for Actions hosted runners slowed the rate at which Standard Ubuntu 22 and Ubuntu 24 runners returned to the available pool, lowering effective runner capacity. About 8 percent of jobs on those runners were delayed past 5 minutes or failed during the window. Engineers mitigated by rolling back to a known-good image version, after which capacity recovered.
4h 28mNot disclosed affectedActions Standard Ubuntu 22 and Ubuntu 24 hosted runner jobscapacity shortfallci/cdcontainer orchestrationcustomer-facing - SEV-1
GitHub
Apr 27, 2026
Elasticsearch overload from suspected botnet traffic degraded search across GitHub
GitHub's Elasticsearch cluster became overloaded due to load that engineers later attributed to suspected botnet activity. Search-backed UI surfaces, including Issues, Pull Requests, Projects, Actions workflow runs, and Packages, returned timed-out or empty results. Engineers identified the source of the additional load and disabled it, allowing the cluster to recover. After the cluster stabilized, GitHub had to reindex Pull Request data, with reindexing continuing into the following days.
6h 15mNot disclosed affectedsearch-backed UI surfaces across GitHub: Issues, Pull Requests, Projects, Actions, Packagesabuse eventcapacity shortfallcustomer-facingddos or abuse traffic - SEV-1
Cloudflare
Aug 21, 2025
Single-customer traffic surge saturates Cloudflare-AWS us-east-1 peering
At 16:27 UTC, a single Cloudflare customer began pulling cached objects from AWS us-east-1 at a rate that doubled total Cloudflare-to-AWS traffic and saturated all direct peering links into us-east-1. AWS attempted to alleviate the congestion by withdrawing BGP advertisements over the saturated links, which rerouted traffic to an offsite peering switch that promptly saturated as well. Two pre-existing infrastructure conditions made the impact worse: one direct peering link was at half capacity due to a known failure, and the Data Center Interconnect to the offsite switch was due for a capacity upgrade. After three hours of manual traffic engineering between Cloudflare and AWS plus rate-limiting the customer, congestion fully resolved at 20:18 UTC.
3h 51mNot disclosed affectedCustomers with origins in AWS us-east-1, primarily traffic transiting Cloudflare's Ashburn (IAD) edgebgp misconfigurationcapacity shortfallcustomer-facingnetwork - SEV-1
Honeycomb
Oct 12, 2018
Two partial API outages from RDS CPU saturation and runaway cache-refresh queries
On October 4, 2018, Honeycomb suffered a partial API outage as RDS MySQL stalled at roughly 90% CPU. A second, less severe incident hit on October 11. In both cases the proximate trigger was a small event coinciding with a baseline CPU level that had drifted up to 30 to 40% over months, leaving no headroom. The deeper cause was cache-refresh queries (rate limit, sampling, blacklist) that fanned out into the thousands of concurrent queries instead of running one at a time. The team shipped a cache fix to coalesce refreshes and upgraded the RDS instance from m4.xlarge to m5.2xlarge during a brief maintenance window on October 12, leaving service stable.
7d 7h 38mNot disclosed affectedAPI customers globally during two ~1-hour windows plus a 2-minute maintenance interruptioncachecapacity shortfalldatabasethundering herd
Your own encores
Turn your incidents into structured post-mortems.
Aftermath runs the whole show: Live Stage captures the incident, Encore writes the post-mortem. Private and structured, like these, but yours.
Get early access