Open Playback · Free & MCP-native
Production failure memory
for your AI coding agent
Real incidents from GitHub, Cloudflare, Linear and others, structured and exposed over MCP. Plug it into Claude or Cursor and ask how production actually breaks
32 encores
Sorted by date
- SEV-1
Honeycomb
Apr 16, 2025
Four API hosts reject 10% of event traffic for 1.5 hours, undetected for a week
On April 23, 2025, Honeycomb discovered an incident that had occurred approximately one week earlier. On April 16, four of Honeycomb's API servers rejected all traffic with 500 or 401 HTTP responses for roughly 1.5 hours, dropping about 10% of event traffic. The incident went unnoticed for around a week until a customer reported missing data. Because Honeycomb retains events for 60 days and includes deployment version as a column on every event, the team was able to look back over a week and reconstruct the failure: an unexpected deployment-version mismatch on those hosts, combined with a caching bug that swallowed a database error and a cache that returned a null value as success. The team forensically diagnosed the issue using the same observability data customers rely on.
1h 30mNot disclosed affectedevents to four API hosts globally; about 10% of event traffic for ~1.5 hourscachedata lossmissing monitoringregression from deploy - SEV-1
Cloudflare
Mar 21, 2025
Wrangler env flag omitted during R2 credential rotation deploys new keys to dev Worker, breaks production
During a routine R2 credential rotation, an engineer ran `wrangler secret put` and `wrangler deploy` without the `--env production` flag. Both commands default to the default environment, so the new credentials landed on a non-production R2 Gateway Worker while the production Worker continued using the old credentials. When the old credentials were deleted from storage as the final step of rotation, the production Gateway lost its ability to authenticate. Investigation took longer than necessary because there was no observability tying credential ID to the live Gateway Worker, so engineers spent over an hour suspecting credential propagation issues before discovering the wrong-environment deploy.
1h 7mNot disclosed affectedGlobal: R2 customers and dependent services (Cache Reserve, Images, Stream, Logpush, Vectorize, Email Security metrics, Billing invoices)ci/cdcredential rotationdeployhuman error - SEV-0
Linear
Jan 24, 2024
Data Loss from Database Migration TRUNCATE CASCADE
A faulty database migration using TRUNCATE TABLE ... CASCADE accidentally deleted production data across multiple core tables — including issue and document descriptions, comments, notifications, favorites, and reactions. The deletion went unnoticed for 30 minutes due to multi-layer caching. Linear was taken offline for one hour, the database was restored from a backup taken several hours before the incident, and a two-day data restoration effort recovered over 99% of lost data.
3h 47mNot disclosed affected—configuration errorcustomer-facingdata corruptiondata loss - SEV-0
Honeycomb
Jul 25, 2023
Total outage: feature-flag bug starves schema cache, MySQL deadlocks, all of Honeycomb goes down for 68 minutes
Late on July 24, 2023, engineers performed a routine cluster switch on Retriever to avoid a subtle bug. Hours later, the Shepherd SLO began burning slowly, but the issue was deemed minor and deferred to morning. The cluster switch had silently stopped the writes that fed the schema cache, and a feature flag bug meant flipping the flag back never re-enabled writes on hosts already told to stop. While engineers prepared a Retriever restart command, MySQL seized up under unexpected read pressure, hit a rare internal deadlock, ran out of connections, and brought down all of Honeycomb. Recovery required circuit-breaking ingest, failing over the database, and manually warming the schema cache.
1h 8mNot disclosed affectedglobal; all user-facing components down; ingest gap permanently visiblecachecascading failuredata lossdatabase - SEV-1
Honeycomb
Sep 8, 2022
Metastable Shepherd cache lock contention takes down ingest for over eight hours
On September 8, 2022, Shepherd, Honeycomb's ingest service, entered a metastable failure loop characterized by repeating shark-fin latency patterns. Each Shepherd worker maintained an in-memory cache of dataset schemas guarded by a table-wide lock, and a missing entry being backfilled could cause unrelated requests to pile up. OOM crashes propagated to Refinery, which in turn was incorrectly suspected of triggering Shepherd's failure. The team's usual workarounds (vertically scaling Shepherds, scaling the database) did not stabilize the system, and after roughly eight and a half hours of intermittent disruption involving about ten engineers, a cache pre-fill change shipped under pressure restored service.
9hNot disclosed affectedmost customers sending data during the incident window experienced at least partial impactcachecascading failurelock contentionout of memory - SEV-1
Honeycomb
Aug 5, 2022
Misconfigured customer SLO triggers continuous Lambda backfill, exhausting shared Lambda capacity
In early August 2022, Honeycomb's SLO measuring trigger run latency began alarming. Engineers initially attributed the issue to known prior reports of customer telemetry with future timestamps that pulled trigger queries into cold storage and onto AWS Lambda. The Incident Commander had been monitoring the future-timestamps issue, and that framing dominated the response. An engineer with fresh context later traced the bulk of Lambda usage to a single misconfigured customer SLO whose SLI never returned valid results, causing Basset to assume a backfill was needed every minute and to relaunch a 60-day cold-storage scan repeatedly. Fixing the SLO on the customer's behalf stopped the bleeding.
9hNot disclosed affectedcustomers depending on triggers and SLO alerting; query performance degradedbackground jobscascading failuremanual actionsla breach - SEV-1
Honeycomb
Nov 22, 2021
BI telemetry change silently breaks 94% of trigger notification emails for four days
On November 18, 2021, between 00:50 and 00:56 UTC, an update intended to improve business-intelligence telemetry from production was deployed. It contained a defect in how the third-party email SDK was used: the response object had to be inspected for hidden errors that did not appear in Go's idiomatic error return value. About 94.1% of trigger notification emails silently failed to send for the next four days. The instrumentation gap meant the email SLO did not detect the failures, automated tests passed because the third-party API was mocked, and the issue went unnoticed until a customer reported it on November 22 at 14:56 UTC.
4d 14h 6mNot disclosed affectedall customers depending on email trigger notifications globallyemail deliverymissing monitoringregression from deploysla breach - SEV-1
Honeycomb
Nov 6, 2019
Slow memory leak across all ingest backends causes four 20-minute brownouts
On November 6, 2019, a slow memory leak introduced in a recent release leaked at the same rate across every ingest backend. All backends therefore ran out of memory and crashed within minutes of one another, causing requests in flight to fail and new requests to find no healthy backend. This produced four roughly 20-minute brownouts that rejected 1-3% of incoming telemetry. The SLO burn alert detected the issue within minutes, but resolution took several hours because the team locked into a confirmation-bias hypothesis blaming AWS Application Load Balancers and waited on a support ticket while the leak quietly recurred. A fresh-eyed engineer eventually noticed the actual symptom: process restarts and steadily climbing memory.
12hNot disclosed affected1-3% of customer telemetry rejected during four ~20-minute windowsload balancermemory leakmissing monitoringout of memory - SEV-0
Stripe
Jul 10, 2019
API severely degraded twice in one day: a minor database version upgrade introduced a latent failover bug that triggered under a rare multi-node stall condition, and the rollback intended to fix it interacted with a recent config change to cause a second distinct outage
On July 10, 2019, the Stripe API experienced two separate periods of severe degradation. The first, lasting 27 minutes, was caused by a latent bug introduced three months earlier in a minor database version upgrade: under a rare condition where multiple nodes stall simultaneously, the new version's failover protocol could not elect a primary, leaving an entire shard unable to accept writes. Because this shard underpinned a wide range of API operations, compute resources were rapidly exhausted across the API. After restarting the cluster to force a new election, Stripe recovered and then rolled back the database version as a precaution. That rollback triggered the second outage: the rolled-back version interacted unexpectedly with a recent configuration change to the production shards, causing CPU starvation on all affected shards. The second outage lasted 93 minutes and required engineers to identify the config interaction, apply a corrected configuration, and restart the cluster again.
6h 12mNot disclosed affectedAll Stripe API users globally; a substantial majority of API requests failed during both degradation windows (16:35–17:02 UTC and 21:14–22:47 UTC)background degradationcascading failureconfig driftconfiguration fix
Your own encores
Turn your incidents into structured post-mortems.
Aftermath runs the whole show: Live Stage captures the incident, Encore writes the post-mortem. Private and structured, like these, but yours.
Get early access