Open Playback · Free & MCP-native
Production failure memory
for your AI coding agent
Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks
32 encores
Sorted by date
- SEV-1
Cloudflare
Jan 22, 2026
Overly permissive routing policy causes IPv6 route leak from Miami router
A change pushed via Cloudflare's policy automation platform was meant to stop a Miami router from advertising prefixes for a Bogota data center after recent infrastructure upgrades made that path unnecessary. Removing the prefix-list reference left the export policy matching by route-type internal alone, which JunOS evaluates broadly enough to include all internal BGP routes. As a result, IPv6 prefixes Cloudflare redistributes internally were exported to external peers and providers in Miami, creating a Type 3/Type 4 route leak in the sense of RFC 7908.
25mNot disclosed affectedIPv6 traffic transiting Cloudflare's Miami edge and external networks whose prefixes were leakedbgp misconfigurationconfiguration changeconfiguration errornetwork - SEV-1
Cloudflare
Jan 8, 2026
1.1.1.1 cache change reorders CNAME records and breaks legacy DNS clients
A memory-optimization change to the 1.1.1.1 resolver's cache merge logic switched the order of records returned in partially expired CNAME chains, causing CNAME records to appear after their target A records instead of before. While modern resolvers tolerate either order, glibc's getaddrinfo and certain Cisco Catalyst switch DNS processes use sequential parsing that requires CNAMEs to be listed first, leading to empty answers and, for the Cisco devices, spontaneous reboot loops. The change had been deployed on January 7 and reached 90 percent of servers before being reverted on January 8.
2h 15mNot disclosed affectedSubset of 1.1.1.1 users running glibc-based stub resolvers and certain Cisco switch DNS clientscacheconfiguration changedns - SEV-0
Honeycomb
Dec 17, 2025
DR exercise cleanup destroys Kafka brokers, leaves partitions leaderless, triggers two-week EU evacuation
On December 5, 2025, Honeycomb ran an annual disaster recovery exercise in the EU production region simulating an availability-zone failure. During the cleanup phase that follows AZ-failure tests, the runbook called for purposely destroying Kafka brokers. In the production cluster's larger topology, this killed brokers across multiple availability zones in an unlucky way that left several partitions leaderless and several internal metadata topics damaged. Recovery extended to December 17 and culminated in a full Kafka cluster evacuation to a brand new cluster, with deep code changes in Retriever to support offset resets and a coordinated migration involving over half a dozen teams. Activity Log data from December 5 to December 9 was lost.
12dNot disclosed affectedEU region: full event ingestion downtime for several hours, then degraded mode (Activity Log only) for two weekscascading failuredata lossfailoverqueue or stream - SEV-1
Cloudflare
Dec 5, 2025
Killswitch on a never-tested rule type triggers nil-value Lua exception in legacy proxy during React vulnerability mitigation
While rolling out a buffer-size increase to mitigate the React Server Components remote code execution vulnerability (CVE-2025-55182), engineers discovered that an internal WAF testing tool did not support the larger 1MB buffer. Disabling the testing tool through Cloudflare's global configuration system propagated network-wide within seconds and triggered a latent bug in the Lua-based FL1 proxy: a killswitch had never been applied to an 'execute' rule action before, and the post-evaluation code assumed the resulting object would always exist. The nil-value lookup raised an exception and caused FL1 to return HTTP 500 errors for all affected customers for 25 minutes.
25mNot disclosed affectedAbout 28% of HTTP traffic served by Cloudflare globally; specifically customers on FL1 with the Cloudflare Managed Ruleset enabledconfiguration changeconfiguration errorpartial outageregression from deploy - SEV-0
Cloudflare
Nov 18, 2025
ClickHouse permissions change doubles Bot Management feature file size and panics the core proxy
A gradual ClickHouse permissions improvement made user accounts able to see metadata for the underlying r0 schema in addition to the default schema. A long-standing query in the Bot Management feature-file generator did not filter by database name, so it began returning duplicate rows and producing a feature file roughly twice its expected size. The new FL2 proxy (Rust) preallocates memory for a hard cap of 200 features, so when the oversized file arrived, the bot module panicked and the proxy returned HTTP 5xx errors for any traffic depending on bot scoring. The legacy FL proxy did not panic but emitted bot scores of zero, causing false positives for any customer using bot-score-based blocking rules.
5h 38mNot disclosed affectedGlobal: majority of core HTTP traffic through Cloudflare's network, plus dependent services (Workers KV, Access, Turnstile, Dashboard, Email Security)configuration changedatabasefull outagethundering herd - SEV-1
Cloudflare
Sep 12, 2025
useEffect dependency bug overwhelms Tenant Service API and breaks dashboard logins
A new dashboard release at 16:32 UTC included a React useEffect with a non-stable object reference in its dependency array, causing the hook to fire on every render and hammer the /organizations endpoint with retries. A coincident Tenant Service deployment at 17:50 UTC began at exactly the wrong moment, and the combined load overwhelmed the service, which sits in the API authorization path. Authorization failures returned 5xx codes from many APIs and left the Cloudflare Dashboard unavailable. A subsequent attempt to fix the Tenant Service made things worse and was reverted, after which dashboard availability fully recovered.
1h 15mNot disclosed affectedCloudflare Dashboard and APIs that depend on Tenant Service authorization (control plane only, data plane unaffected)deployfrontendpartial outagerate limit misconfigured - SEV-1
Cloudflare
Aug 21, 2025
Single-customer traffic surge saturates Cloudflare-AWS us-east-1 peering
At 16:27 UTC, a single Cloudflare customer began pulling cached objects from AWS us-east-1 at a rate that doubled total Cloudflare-to-AWS traffic and saturated all direct peering links into us-east-1. AWS attempted to alleviate the congestion by withdrawing BGP advertisements over the saturated links, which rerouted traffic to an offsite peering switch that promptly saturated as well. Two pre-existing infrastructure conditions made the impact worse: one direct peering link was at half capacity due to a known failure, and the Data Center Interconnect to the offsite switch was due for a capacity upgrade. After three hours of manual traffic engineering between Cloudflare and AWS plus rate-limiting the customer, congestion fully resolved at 20:18 UTC.
3h 51mNot disclosed affectedCustomers with origins in AWS us-east-1, primarily traffic transiting Cloudflare's Ashburn (IAD) edgebgp misconfigurationcapacity shortfallcustomer-facingnetwork - SEV-1
Cloudflare
Aug 23, 2025
Salesforce support-case data exfiltrated via compromised Salesloft Drift OAuth token
An advanced threat actor Cloudflare tracks as GRUB1 (overlapping with Google's UNC6395) exploited the Salesloft Drift integration with Salesforce by using stolen OAuth credentials to access Cloudflare's Salesforce tenant. The actor performed reconnaissance starting August 9, accessed the tenant on August 12, and used Salesforce's Bulk API 2.0 on August 17 to exfiltrate the text of customer support cases in roughly three minutes. The attacker then deleted the Bulk API job to hide evidence. The breach was part of a broader supply-chain campaign affecting hundreds of Salesloft customers; Cloudflare disabled Drift, rotated 104 customer-issued API tokens, and notified affected customers.
14d 9mNot disclosed affectedCloudflare's Salesforce tenant; case-object data including customer contact info and support correspondence (no Cloudflare infrastructure or services)authcredential rotationdata exposuresupply chain - SEV-1
Cloudflare
Jun 12, 2025
Third-party storage outage takes Workers KV offline and cascades through Access, WARP, and Dashboard
Workers KV's central storage backend, partially backed by a third-party cloud provider, suffered an outage that took Workers KV offline. Because Workers KV is a foundational dependency for many Cloudflare products, the failure cascaded across the platform: Access failed 100% of identity-based logins, WARP could not register new clients, the Dashboard could not authenticate via Turnstile or OIDC, and a long list of products including Workers AI, Stream, Pages, D1, and Durable Objects experienced significant errors. Engineers worked in parallel on bypasses and on accelerating an already-planned migration of KV onto more redundant infrastructure. Service recovered when the third-party storage came back online at 20:23 UTC.
2h 36mNot disclosed affectedGlobal: Workers KV and every Cloudflare service that depended on it (Access, WARP, Gateway, Dashboard, Images, Stream, Workers AI, Turnstile, Pages, D1, Durable Objects, Queues, AI Gateway, and more)cascading failuredependency outageobject storagesupply chain
Your own encores
Turn your incidents into structured post-mortems.
Aftermath runs the whole show: Live Stage captures the incident, Encore writes the post-mortem. Private and structured, like these, but yours.
Get early access