Open Playback · Free & MCP-native
Production failure memory
for your AI coding agent
Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks
32 encores
Sorted by date
- SEV-1
GitHub
Apr 28, 2026
Actions Ubuntu hosted runners delayed by performance regression in VM reimage process
A performance regression in the VM reimage process for Actions hosted runners slowed the rate at which Standard Ubuntu 22 and Ubuntu 24 runners returned to the available pool, lowering effective runner capacity. About 8 percent of jobs on those runners were delayed past 5 minutes or failed during the window. Engineers mitigated by rolling back to a known-good image version, after which capacity recovered.
4h 28mNot disclosed affectedActions Standard Ubuntu 22 and Ubuntu 24 hosted runner jobscapacity shortfallci/cdcontainer orchestrationcustomer-facing - SEV-1
GitHub
Apr 27, 2026
Elasticsearch overload from suspected botnet traffic degraded search across GitHub
GitHub's Elasticsearch cluster became overloaded due to load that engineers later attributed to suspected botnet activity. Search-backed UI surfaces, including Issues, Pull Requests, Projects, Actions workflow runs, and Packages, returned timed-out or empty results. Engineers identified the source of the additional load and disabled it, allowing the cluster to recover. After the cluster stabilized, GitHub had to reindex Pull Request data, with reindexing continuing into the following days.
6h 15mNot disclosed affectedsearch-backed UI surfaces across GitHub: Issues, Pull Requests, Projects, Actions, Packagesabuse eventcapacity shortfallcustomer-facingddos or abuse traffic - SEV-1
GitHub
Apr 23, 2026
DNS resolution failures in VA3 datacenter degraded multiple GitHub services
DNS resolution failures originating in GitHub's VA3 datacenter caused elevated error rates and degraded performance across several GitHub services, with the impact concentrated on Actions, Copilot, and Webhooks. Roughly 5 to 7 percent of overall traffic was affected during the window. Engineers identified the source of the resolution failures and applied a mitigation, after which dependent services recovered.
1h 24mNot disclosed affectedapproximately 5-7 percent of overall traffic; Actions, Copilot, Webhooks affectedconfiguration fixcustomer-facingdegraded performancedns - SEV-1
GitHub
Apr 23, 2026
Billing service config change overwhelmed cache, degrading github.com, Codespaces, Packages, and Actions
A configuration change to an internal billing service caused a shared cache to be overwhelmed, leading to request timeouts and degraded experiences across github.com, Codespaces, Packages, Copilot, and Actions. Web requests returned 5xx errors, Codespaces create and resume requests failed at high rates, and a large fraction of Actions jobs were delayed or failed. The mitigation rolled back or corrected the billing configuration; Actions then drained its queued backlog.
48mNot disclosed affectedgithub.com web, Codespaces, Packages, Copilot, Actionscachecache stampedecascading failureci/cd - SEV-1
GitHub
Apr 22, 2026
Copilot Chat and Cloud Agent unavailable after infrastructure config change broke database connectivity
An infrastructure configuration change broke database connectivity for Copilot Chat and Cloud Agent on github.com, leaving users unable to interact with either service. Copilot Memory in preview was also unavailable to agent sessions during the window. Engineers identified the change as the cause and restored connectivity, with github.com recovering first and remaining regional deployments restored incrementally.
4h 2mNot disclosed affectedCopilot Chat and Cloud Agent users globally; staged regional recoveryauthconfiguration errorconfiguration fixcustomer-facing - SEV-1
GitHub
Apr 13, 2026
Pages returned 500 errors after octodns automation deleted a backend DNS record
An automated DNS management tool, octodns, deleted a DNS record for a Pages backend storage host after its upstream data source intermittently failed to return the record. The automation treated the missing record as stale and removed it, causing Pages requests routed to that host to return HTTP 500 errors. Engineers re-created the deleted record to mitigate. The incident exposed the fact that the Pages frontend did not fail over to healthy backend hosts when one became unresolvable.
1h 37mNot disclosed affectedGitHub Pages traffic globallyconfiguration errorcustomer-facingdata repairdns - SEV-1
GitHub
Mar 24, 2026
Teams Integration unable to deliver GitHub notifications during upstream provider outage
An outage at an upstream dependency caused HTTP 500 errors and connection resets on the path used to deliver GitHub event notifications to Microsoft Teams. The integration could not relay notifications during the impact window, with about 19 percent of integration installs affected. GitHub coordinated with the relevant service teams and the issue resolved when the upstream incident was mitigated.
3h 54mNot disclosed affectedMicrosoft Teams Integration installs (~19% failed deliveries)customer-facingdelayed processingdependency outagenotification - SEV-1
Linear
Mar 24, 2026
Permission Filter Bypass from Variable Shadowing Bug
A performance optimization deployed to production contained a variable shadowing bug that caused team-level permission filters to be silently skipped. For approximately one hour, workspace members — including guests — could access data belonging to private teams within their own workspace via notification emails, client data sync, mobile sessions, API calls, and background tasks. No data was exposed outside any workspace, and no credentials were compromised. The change was reverted within the hour, all affected client sessions were cleared, and a post-incident audit found no evidence of malicious exploitation.
1h 3mNot disclosed affected—api gatewayauthconfiguration errorcredential rotation - SEV-1
Cloudflare
Feb 20, 2026
BYOIP prefixes withdrawn after Addressing API cleanup task misinterprets empty filter parameter
An automated cleanup sub-task in Cloudflare's Addressing API incorrectly queried the API with an empty pending_delete parameter, which the server interpreted as a request for all BYOIP prefixes. The task then began systematically deleting all matching prefixes and their service bindings, withdrawing about 1,100 BGP prefixes from the Internet. Engineers stopped the runaway sub-task within 50 minutes, but full restoration took over six hours because some customers had service bindings stripped from edge servers and required a global configuration rollout to repair.
5h 7mNot disclosed affectedBYOIP customers globally (about 25% of Cloudflare's BYOIP prefixes)bgp misconfigurationconfiguration errordeployhuman error
Your own encores
Turn your incidents into structured post-mortems.
Aftermath runs the whole show: Live Stage captures the incident, Encore writes the post-mortem. Private and structured, like these, but yours.
Get early access