Open Playback · Free & MCP-native

Production failure memory
for your AI coding agent

Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks

all abuse event api gateway auth background degradation background jobs bgp misconfiguration cache cache stampede capacity shortfall

MCP

Connect via MCP

Drop this into Claude Desktop or Cursor. Then ask your agent about cache stampedes, BGP failures, or DNS outages.

"open-playback": {
  "type": "http",
  "url": "https://open-mcp.aftermath.sh/"
}

4 encores

Sorted by date

GitHub
Apr 28, 2026
SEV-1
Actions Ubuntu hosted runners delayed by performance regression in VM reimage process
A performance regression in the VM reimage process for Actions hosted runners slowed the rate at which Standard Ubuntu 22 and Ubuntu 24 runners returned to the available pool, lowering effective runner capacity. About 8 percent of jobs on those runners were delayed past 5 minutes or failed during the window. Engineers mitigated by rolling back to a known-good image version, after which capacity recovered.
4h 28mNot disclosed affectedActions Standard Ubuntu 22 and Ubuntu 24 hosted runner jobs
capacity shortfallci/cdcontainer orchestrationcustomer-facing
GitHub
Apr 27, 2026
SEV-1
Elasticsearch overload from suspected botnet traffic degraded search across GitHub
GitHub's Elasticsearch cluster became overloaded due to load that engineers later attributed to suspected botnet activity. Search-backed UI surfaces, including Issues, Pull Requests, Projects, Actions workflow runs, and Packages, returned timed-out or empty results. Engineers identified the source of the additional load and disabled it, allowing the cluster to recover. After the cluster stabilized, GitHub had to reindex Pull Request data, with reindexing continuing into the following days.
6h 15mNot disclosed affectedsearch-backed UI surfaces across GitHub: Issues, Pull Requests, Projects, Actions, Packages
abuse eventcapacity shortfallcustomer-facingddos or abuse traffic
Cloudflare
Aug 21, 2025
SEV-1
Single-customer traffic surge saturates Cloudflare-AWS us-east-1 peering
At 16:27 UTC, a single Cloudflare customer began pulling cached objects from AWS us-east-1 at a rate that doubled total Cloudflare-to-AWS traffic and saturated all direct peering links into us-east-1. AWS attempted to alleviate the congestion by withdrawing BGP advertisements over the saturated links, which rerouted traffic to an offsite peering switch that promptly saturated as well. Two pre-existing infrastructure conditions made the impact worse: one direct peering link was at half capacity due to a known failure, and the Data Center Interconnect to the offsite switch was due for a capacity upgrade. After three hours of manual traffic engineering between Cloudflare and AWS plus rate-limiting the customer, congestion fully resolved at 20:18 UTC.
3h 51mNot disclosed affectedCustomers with origins in AWS us-east-1, primarily traffic transiting Cloudflare's Ashburn (IAD) edge
bgp misconfigurationcapacity shortfallcustomer-facingnetwork
Honeycomb
Oct 12, 2018
SEV-1
Two partial API outages from RDS CPU saturation and runaway cache-refresh queries
On October 4, 2018, Honeycomb suffered a partial API outage as RDS MySQL stalled at roughly 90% CPU. A second, less severe incident hit on October 11. In both cases the proximate trigger was a small event coinciding with a baseline CPU level that had drifted up to 30 to 40% over months, leaving no headroom. The deeper cause was cache-refresh queries (rate limit, sampling, blacklist) that fanned out into the thousands of concurrent queries instead of running one at a time. The team shipped a cache fix to coalesce refreshes and upgraded the RDS instance from m4.xlarge to m5.2xlarge during a brief maintenance window on October 12, leaving service stable.
7d 7h 38mNot disclosed affectedAPI customers globally during two ~1-hour windows plus a 2-minute maintenance interruption
cachecapacity shortfalldatabasethundering herd

Production failure memory for your AI coding agent

Actions Ubuntu hosted runners delayed by performance regression in VM reimage process

Elasticsearch overload from suspected botnet traffic degraded search across GitHub

Single-customer traffic surge saturates Cloudflare-AWS us-east-1 peering

Two partial API outages from RDS CPU saturation and runaway cache-refresh queries

Production failure memory
for your AI coding agent