SEV-1public access

Metastable Shepherd cache lock contention takes down ingest for over eight hours

Honeycomb · Source

Started: Sep 8, 2022
Duration: 9h
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: most customers sending data during the incident window experienced at least partial impact
Services: shepherd, refinery, kafka, ingest, mysql

cachecascading failurelock contentionout of memoryscale up

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On September 8, 2022, Shepherd, Honeycomb's ingest service, entered a metastable failure loop characterized by repeating shark-fin latency patterns. Each Shepherd worker maintained an in-memory cache of dataset schemas guarded by a table-wide lock, and a missing entry being backfilled could cause unrelated requests to pile up. OOM crashes propagated to Refinery, which in turn was incorrectly suspected of triggering Shepherd's failure. The team's usual workarounds (vertically scaling Shepherds, scaling the database) did not stabilize the system, and after roughly eight and a half hours of intermittent disruption involving about ten engineers, a cache pre-fill change shipped under pressure restored service.

Impact

Ingest data was significantly disrupted for over eight and a half hours, not all consecutive. The team estimates most customers sending data at the time were impacted at least partially.

Root cause

Each Shepherd worker maintained an in-memory cache of dataset schemas, and updates to that cache acquired a table-wide lock; when a missing entry was backfilled, unrelated requests piled up behind the lock.

The system entered a metastable failure loop: prior shark-fin episodes had self-resolved within 15 minutes, but this time the bad state persisted.

Tension between competing remediations (fewer Shepherds reduce database contention; more Shepherds reduce per-host cache contention) meant scaling actions could make things worse, and the system was potentially suffering both contention regimes simultaneously.

Refinery, which sampled Shepherd's traces, also began OOMing, and the team initially read the relationship as Refinery causing Shepherd to fail rather than the reverse.

Pinned builds did not stop a Shepherd chart change from rolling out and reintroducing the shark-fin pattern, because service definitions were exempt from the artifact pin.

Aggressive sampling adopted to stabilize Refinery reduced the team's observability of their own system at exactly the time they needed it most.

Resolution

A cache pre-fill change was implemented and deployed to ensure Shepherd hosts populated their cache before accepting traffic, eliminating the lock contention spike on cold starts. Service stabilized immediately after the pre-fill landed. Refinery was then stabilized by adding hosts to its pool to handle the heavier load.

Timeline

13:00DETECT
Shepherd begins exhibiting the 'shark fin' latency pattern. Past experience suggests it will self-resolve within 15 minutes.
shepherd
13:30DETECT
Pattern persists past the usual recovery window. Shepherd hosts begin OOMing and restarting; Refinery also enters cascading crashes.
shepherd
14:00INVEST
Working hypothesis: Refinery is failing first and Shepherd is queueing data it cannot flush. Team adopts more aggressive sampling to stabilize Refinery.
refinery
15:00INVEST
Aggressive sampling reduces observability. OOMs recur in both Shepherd and Refinery. Team falls back to cache-adjacent hypotheses and tries scaling Shepherds vertically; nothing helps.
shepherd
16:00MITIG
System unexpectedly stabilizes with no clear explanation. Team pins builds to prevent further deploys and most engineers rest while a few investigate.
shepherd
17:00INVEST
Investigators discover the table-wide lock in Shepherd's schema cache: a missing entry being backfilled can cause unrelated requests to pile up.
shepherd
18:00DETECT
Despite the build pin, a Shepherd chart change rolls out and the shark-fin pattern resumes; service definitions were not covered by the pin.
shepherd
19:00MITIG
Forced scale-down of Shepherds is attempted but the autoscaler reacts to the crashloop CPU spike by adding more hosts. Approach abandoned.
shepherd
20:00MITIG
Engineers commit to fixing the cache itself under pressure: reduce contention on the hot path and pre-fill the cache before accepting traffic.
shepherd
21:30MITIG
Cache pre-fill fix deployed. Ingest, connection pools, Shepherd, and Kafka all improve immediately. Only Refinery remains broken.
shepherd
22:00RESOLV
Refinery is stabilized by adding hosts. The team confirms Shepherd can run fine without Refinery, contradicting the early hypothesis.
refinery

Attribution

Honeycomb

By Engineering

Published Sep 8, 2022

View original source

Lessons

Shark-fin latency graphs can look like recovery as the line falls, but the worst-offender spans complete last and the chart re-deteriorates a few minutes later; observations must be lagged before any effect is trusted.
When everything appears to feed into everything, and engineers feel they can't help, the human cost of the incident outlasts the technical impact - rest and rotation are part of recovery.
Pinning build artifacts is not the same as freezing the running system; service definitions, charts, and configs are separate change vectors that must be controlled together when stabilizing.
It is sometimes acceptable to ship a cache or hot-path fix under incident pressure if no stabilization path exists, but the preferred order is stabilize, understand, then fix while rested.
Mental models of how a service fails can be inaccurate in subtle ways, and incidents are an opportunity to surface and correct these models even when the trigger remains unknown.

Action items

Pre-fill the schema cache on Shepherd boot before the host accepts traffic.
Reduce contention on the cache hot path so backfilling one missing entry does not block unrelated requests.
Bring the artifact pinning mechanism in line with the rest of the deploy surface so service definitions and charts are also controlled during incidents.
Add Refinery host capacity to handle the heavier load from Shepherd traffic patterns.
Encourage engineers involved in long incidents to take time off to rest; the people-impact of these events outlasts the technical recovery.