Open Playback · Free & MCP-native
Production failure memory
for your AI coding agent
Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks
6 encores
Sorted by date
- SEV-1
GitHub
Apr 23, 2026
Billing service config change overwhelmed cache, degrading github.com, Codespaces, Packages, and Actions
A configuration change to an internal billing service caused a shared cache to be overwhelmed, leading to request timeouts and degraded experiences across github.com, Codespaces, Packages, Copilot, and Actions. Web requests returned 5xx errors, Codespaces create and resume requests failed at high rates, and a large fraction of Actions jobs were delayed or failed. The mitigation rolled back or corrected the billing configuration; Actions then drained its queued backlog.
48mNot disclosed affectedgithub.com web, Codespaces, Packages, Copilot, Actionscachecache stampedecascading failureci/cd - SEV-1
Cloudflare
Jan 8, 2026
1.1.1.1 cache change reorders CNAME records and breaks legacy DNS clients
A memory-optimization change to the 1.1.1.1 resolver's cache merge logic switched the order of records returned in partially expired CNAME chains, causing CNAME records to appear after their target A records instead of before. While modern resolvers tolerate either order, glibc's getaddrinfo and certain Cisco Catalyst switch DNS processes use sequential parsing that requires CNAMEs to be listed first, leading to empty answers and, for the Cisco devices, spontaneous reboot loops. The change had been deployed on January 7 and reached 90 percent of servers before being reverted on January 8.
2h 15mNot disclosed affectedSubset of 1.1.1.1 users running glibc-based stub resolvers and certain Cisco switch DNS clientscacheconfiguration changedns - SEV-1
Honeycomb
Apr 16, 2025
Four API hosts reject 10% of event traffic for 1.5 hours, undetected for a week
On April 23, 2025, Honeycomb discovered an incident that had occurred approximately one week earlier. On April 16, four of Honeycomb's API servers rejected all traffic with 500 or 401 HTTP responses for roughly 1.5 hours, dropping about 10% of event traffic. The incident went unnoticed for around a week until a customer reported missing data. Because Honeycomb retains events for 60 days and includes deployment version as a column on every event, the team was able to look back over a week and reconstruct the failure: an unexpected deployment-version mismatch on those hosts, combined with a caching bug that swallowed a database error and a cache that returned a null value as success. The team forensically diagnosed the issue using the same observability data customers rely on.
1h 30mNot disclosed affectedevents to four API hosts globally; about 10% of event traffic for ~1.5 hourscachedata lossmissing monitoringregression from deploy - SEV-0
Honeycomb
Jul 25, 2023
Total outage: feature-flag bug starves schema cache, MySQL deadlocks, all of Honeycomb goes down for 68 minutes
Late on July 24, 2023, engineers performed a routine cluster switch on Retriever to avoid a subtle bug. Hours later, the Shepherd SLO began burning slowly, but the issue was deemed minor and deferred to morning. The cluster switch had silently stopped the writes that fed the schema cache, and a feature flag bug meant flipping the flag back never re-enabled writes on hosts already told to stop. While engineers prepared a Retriever restart command, MySQL seized up under unexpected read pressure, hit a rare internal deadlock, ran out of connections, and brought down all of Honeycomb. Recovery required circuit-breaking ingest, failing over the database, and manually warming the schema cache.
1h 8mNot disclosed affectedglobal; all user-facing components down; ingest gap permanently visiblecachecascading failuredata lossdatabase - SEV-1
Honeycomb
Sep 8, 2022
Metastable Shepherd cache lock contention takes down ingest for over eight hours
On September 8, 2022, Shepherd, Honeycomb's ingest service, entered a metastable failure loop characterized by repeating shark-fin latency patterns. Each Shepherd worker maintained an in-memory cache of dataset schemas guarded by a table-wide lock, and a missing entry being backfilled could cause unrelated requests to pile up. OOM crashes propagated to Refinery, which in turn was incorrectly suspected of triggering Shepherd's failure. The team's usual workarounds (vertically scaling Shepherds, scaling the database) did not stabilize the system, and after roughly eight and a half hours of intermittent disruption involving about ten engineers, a cache pre-fill change shipped under pressure restored service.
9hNot disclosed affectedmost customers sending data during the incident window experienced at least partial impactcachecascading failurelock contentionout of memory - SEV-1
Honeycomb
Oct 12, 2018
Two partial API outages from RDS CPU saturation and runaway cache-refresh queries
On October 4, 2018, Honeycomb suffered a partial API outage as RDS MySQL stalled at roughly 90% CPU. A second, less severe incident hit on October 11. In both cases the proximate trigger was a small event coinciding with a baseline CPU level that had drifted up to 30 to 40% over months, leaving no headroom. The deeper cause was cache-refresh queries (rate limit, sampling, blacklist) that fanned out into the thousands of concurrent queries instead of running one at a time. The team shipped a cache fix to coalesce refreshes and upgraded the RDS instance from m4.xlarge to m5.2xlarge during a brief maintenance window on October 12, leaving service stable.
7d 7h 38mNot disclosed affectedAPI customers globally during two ~1-hour windows plus a 2-minute maintenance interruptioncachecapacity shortfalldatabasethundering herd
Your own encores
Turn your incidents into structured post-mortems.
Aftermath runs the whole show: Live Stage captures the incident, Encore writes the post-mortem. Private and structured, like these, but yours.
Get early access