Back to Open Playback
SEV-1public access

Metastable Shepherd cache lock contention takes down ingest for over eight hours

Honeycomb · Source

Started
Sep 8, 2022
Duration
9h
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
most customers sending data during the incident window experienced at least partial impact
Services
shepherd, refinery, kafka, ingest, mysql
cachecascading failurelock contentionout of memoryscale up

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On September 8, 2022, Shepherd, Honeycomb's ingest service, entered a metastable failure loop characterized by repeating shark-fin latency patterns. Each Shepherd worker maintained an in-memory cache of dataset schemas guarded by a table-wide lock, and a missing entry being backfilled could cause unrelated requests to pile up. OOM crashes propagated to Refinery, which in turn was incorrectly suspected of triggering Shepherd's failure. The team's usual workarounds (vertically scaling Shepherds, scaling the database) did not stabilize the system, and after roughly eight and a half hours of intermittent disruption involving about ten engineers, a cache pre-fill change shipped under pressure restored service.

Impact

Ingest data was significantly disrupted for over eight and a half hours, not all consecutive. The team estimates most customers sending data at the time were impacted at least partially.

Root cause

Each Shepherd worker maintained an in-memory cache of dataset schemas, and updates to that cache acquired a table-wide lock; when a missing entry was backfilled, unrelated requests piled up behind the lock.

The system entered a metastable failure loop: prior shark-fin episodes had self-resolved within 15 minutes, but this time the bad state persisted.

Tension between competing remediations (fewer Shepherds reduce database contention; more Shepherds reduce per-host cache contention) meant scaling actions could make things worse, and the system was potentially suffering both contention regimes simultaneously.

Refinery, which sampled Shepherd's traces, also began OOMing, and the team initially read the relationship as Refinery causing Shepherd to fail rather than the reverse.

Pinned builds did not stop a Shepherd chart change from rolling out and reintroducing the shark-fin pattern, because service definitions were exempt from the artifact pin.

Aggressive sampling adopted to stabilize Refinery reduced the team's observability of their own system at exactly the time they needed it most.

Resolution

A cache pre-fill change was implemented and deployed to ensure Shepherd hosts populated their cache before accepting traffic, eliminating the lock contention spike on cold starts. Service stabilized immediately after the pre-fill landed. Refinery was then stabilized by adding hosts to its pool to handle the heavier load.

Lessons

  • Shark-fin latency graphs can look like recovery as the line falls, but the worst-offender spans complete last and the chart re-deteriorates a few minutes later; observations must be lagged before any effect is trusted.
  • When everything appears to feed into everything, and engineers feel they can't help, the human cost of the incident outlasts the technical impact - rest and rotation are part of recovery.
  • Pinning build artifacts is not the same as freezing the running system; service definitions, charts, and configs are separate change vectors that must be controlled together when stabilizing.
  • It is sometimes acceptable to ship a cache or hot-path fix under incident pressure if no stabilization path exists, but the preferred order is stabilize, understand, then fix while rested.
  • Mental models of how a service fails can be inaccurate in subtle ways, and incidents are an opportunity to surface and correct these models even when the trigger remains unknown.

Action items

  • Pre-fill the schema cache on Shepherd boot before the host accepts traffic.
  • Reduce contention on the cache hot path so backfilling one missing entry does not block unrelated requests.
  • Bring the artifact pinning mechanism in line with the rest of the deploy surface so service definitions and charts are also controlled during incidents.
  • Add Refinery host capacity to handle the heavier load from Shepherd traffic patterns.
  • Encourage engineers involved in long incidents to take time off to rest; the people-impact of these events outlasts the technical recovery.