Total outage: feature-flag bug starves schema cache, MySQL deadlocks, all of Honeycomb goes down for 68 minutes
Honeycomb · Source
- Started
- Jul 25, 2023
- Duration
- 1h 8m
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- global; all user-facing components down; ingest gap permanently visible
- Services
- retriever, shepherd, mysql-rds, ingest, query, alerting
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
Late on July 24, 2023, engineers performed a routine cluster switch on Retriever to avoid a subtle bug. Hours later, the Shepherd SLO began burning slowly, but the issue was deemed minor and deferred to morning. The cluster switch had silently stopped the writes that fed the schema cache, and a feature flag bug meant flipping the flag back never re-enabled writes on hosts already told to stop. While engineers prepared a Retriever restart command, MySQL seized up under unexpected read pressure, hit a rare internal deadlock, ran out of connections, and brought down all of Honeycomb. Recovery required circuit-breaking ingest, failing over the database, and manually warming the schema cache.
Impact
All user-facing Honeycomb components were unavailable from 13:40 UTC to 14:48 UTC on July 25, 2023. No data could be processed or accessed during the 68-minute window. Telemetry not buffered client-side was lost; the ingest gap remains visible in the data for as long as it is retained.
Root cause
A routine cluster switch in Retriever caused the new cluster to fail to update timestamps used by the schema cache, undermining ingest's cached view of dataset schemas.
A latent feature-flag implementation bug: when the flag was switched, hosts told to stop never tried again even when the flag was switched back. A full reboot was required for writes to migrate.
Frequent deploys had been silently masking the feature-flag bug for years by restarting hosts as a side effect; pausing deploys during the investigation removed the masking.
Once the schema cache stopped being refreshed, reads against MySQL spiked and a few normal writes coincided with a rare race condition in MySQL internals, locking thread after thread until connections were exhausted.
Past near-misses had taught the team that read-replica failovers caused performance issues, leading them to defer a full failover during the actual incident; that hesitation slowed recovery.
Many widely-trusted practices (feature flags, frequent deploys, suspending deploys during incidents, learning from prior near-misses) each contributed in non-obvious ways to the outage.
Resolution
Engineers set up circuit breakers to reject all Shepherd traffic with 5xx errors to protect the database, failed MySQL over to a replica after the primary was hard-locked, then manually marked all recently-active schemas with a 'last written: now' timestamp so the schema cache would reload all data. Once the cache was warm, the circuit breaker was removed, ingest was restored, and remaining Retriever hosts that had failed internal checks were restarted.
Lessons
- Many trusted reliability practices can, in combination, contribute to an incident; the same techniques that make systems safer also obscure latent bugs they had been masking.
- A 'minor SLO burn with no good explanation' deferred to morning is exactly the failure mode that grows into a total outage; the unexplained part is the warning.
- Avoiding a subtle known bug by switching to older infrastructure can set the stage for a larger unforeseen outage; the move toward a 'safe' state is itself a change with unknown consequences.
- Prior near-misses can train a team to avoid the right action under real pressure; a database failover that 'felt slow' historically can become the only recovery path during a deadlock.
- Indirect dependencies created by shared caches and shared databases form an invisible web that makes 'we just touched X' resonate across far more services than the change-author intended.
Action items
- Migration completed and all code that could disable writes behind the cache has been removed.
- Future architectures under exploration for further cache strengthening, contention reduction during schema 'update storms,' and stabilizing performance costs of expensive operations.
- Updated assumption that database failover is the fastest recovery path for unlikely lockups of this kind.
- Investing in instrumentation and experimentation to better detect and handle these edge cases.