SEV-0public access

Total outage: feature-flag bug starves schema cache, MySQL deadlocks, all of Honeycomb goes down for 68 minutes

Honeycomb · Source

Started: Jul 25, 2023
Duration: 1h 8m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: global; all user-facing components down; ingest gap permanently visible
Services: retriever, shepherd, mysql-rds, ingest, query, alerting

cachecascading failuredata lossdatabasedeadlockfeature flag misconfiguration

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

Late on July 24, 2023, engineers performed a routine cluster switch on Retriever to avoid a subtle bug. Hours later, the Shepherd SLO began burning slowly, but the issue was deemed minor and deferred to morning. The cluster switch had silently stopped the writes that fed the schema cache, and a feature flag bug meant flipping the flag back never re-enabled writes on hosts already told to stop. While engineers prepared a Retriever restart command, MySQL seized up under unexpected read pressure, hit a rare internal deadlock, ran out of connections, and brought down all of Honeycomb. Recovery required circuit-breaking ingest, failing over the database, and manually warming the schema cache.

Impact

All user-facing Honeycomb components were unavailable from 13:40 UTC to 14:48 UTC on July 25, 2023. No data could be processed or accessed during the 68-minute window. Telemetry not buffered client-side was lost; the ingest gap remains visible in the data for as long as it is retained.

Root cause

A routine cluster switch in Retriever caused the new cluster to fail to update timestamps used by the schema cache, undermining ingest's cached view of dataset schemas.

A latent feature-flag implementation bug: when the flag was switched, hosts told to stop never tried again even when the flag was switched back. A full reboot was required for writes to migrate.

Frequent deploys had been silently masking the feature-flag bug for years by restarting hosts as a side effect; pausing deploys during the investigation removed the masking.

Once the schema cache stopped being refreshed, reads against MySQL spiked and a few normal writes coincided with a rare race condition in MySQL internals, locking thread after thread until connections were exhausted.

Past near-misses had taught the team that read-replica failovers caused performance issues, leading them to defer a full failover during the actual incident; that hesitation slowed recovery.

Many widely-trusted practices (feature flags, frequent deploys, suspending deploys during incidents, learning from prior near-misses) each contributed in non-obvious ways to the outage.

Resolution

Engineers set up circuit breakers to reject all Shepherd traffic with 5xx errors to protect the database, failed MySQL over to a replica after the primary was hard-locked, then manually marked all recently-active schemas with a 'last written: now' timestamp so the schema cache would reload all data. Once the cache was warm, the circuit breaker was removed, ingest was restored, and remaining Retriever hosts that had failed internal checks were restarted.

Timeline

22:00MITIG
Engineers perform a routine cluster switch in Retriever to avoid a known subtle bug. The switch silently stops the writes that update timestamps feeding the schema cache.
retriever
03:00DETECT
Hours later, the Shepherd SLO begins burning slowly. The performance issue is unexplained but marginal, so the team decides to investigate in the morning.
shepherd
12:00INVEST
In the morning, engineers find that the new Retriever cluster's database calls for cache timestamps suddenly stopped, undermining the ingest schema cache.
retriever
12:30MITIG
Team flips a feature flag to send only writes back to the previous cluster, expecting that to restore cache updates. It does not work.
retriever
13:30INVEST
Engineers identify the implementation bug: hosts told to stop never resume when the flag is switched back. A full reboot is required.
retriever
13:40DETECT
Just before the restart command is ready, MySQL seizes up under read pressure from cache misses. A rare deadlock cascades across threads; connections exhaust. All of Honeycomb goes down.
mysql-rds
13:50MITIG
Circuit breakers set up to reject all Shepherd traffic with 5xx errors to protect the database from further pressure.
shepherd
14:10MITIG
Attempt to recover the database host fails; it is hard-locked. Team fails over to a replica.
mysql-rds
14:25MITIG
Database recovers. Engineers manually update 'last written' timestamps for all schemas seen in the past day to 'now', forcing the schema cache to reload.
mysql-rds
14:40MITIG
Circuit breaker removed; ingest resumes. Remaining Retriever hosts that failed internal checks are restarted.
shepherd
14:48RESOLV
All services healthy; full querying capacity restored. Honeycomb is back.
retriever

Attribution

Honeycomb

By Engineering

Published Jul 25, 2023

View original source

Lessons

Many trusted reliability practices can, in combination, contribute to an incident; the same techniques that make systems safer also obscure latent bugs they had been masking.
A 'minor SLO burn with no good explanation' deferred to morning is exactly the failure mode that grows into a total outage; the unexplained part is the warning.
Avoiding a subtle known bug by switching to older infrastructure can set the stage for a larger unforeseen outage; the move toward a 'safe' state is itself a change with unknown consequences.
Prior near-misses can train a team to avoid the right action under real pressure; a database failover that 'felt slow' historically can become the only recovery path during a deadlock.
Indirect dependencies created by shared caches and shared databases form an invisible web that makes 'we just touched X' resonate across far more services than the change-author intended.

Action items

Migration completed and all code that could disable writes behind the cache has been removed.
Future architectures under exploration for further cache strengthening, contention reduction during schema 'update storms,' and stabilizing performance costs of expensive operations.
Updated assumption that database failover is the fastest recovery path for unlikely lockups of this kind.
Investing in instrumentation and experimentation to better detect and handle these edge cases.