Back to Open Playback
SEV-1public access

Slow memory leak across all ingest backends causes four 20-minute brownouts

Honeycomb · Source

Started
Nov 6, 2019
Duration
12h
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
1-3% of customer telemetry rejected during four ~20-minute windows
Services
ingest, alb, slo
load balancermemory leakmissing monitoringout of memorysla breach

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On November 6, 2019, a slow memory leak introduced in a recent release leaked at the same rate across every ingest backend. All backends therefore ran out of memory and crashed within minutes of one another, causing requests in flight to fail and new requests to find no healthy backend. This produced four roughly 20-minute brownouts that rejected 1-3% of incoming telemetry. The SLO burn alert detected the issue within minutes, but resolution took several hours because the team locked into a confirmation-bias hypothesis blaming AWS Application Load Balancers and waited on a support ticket while the leak quietly recurred. A fresh-eyed engineer eventually noticed the actual symptom: process restarts and steadily climbing memory.

Impact

1-3% of customer telemetry was rejected at ingest during four roughly 20-minute brownouts spread across the day. Some end-to-end probers timed out and recovered immediately, masking the issue from traditional black-box monitoring.

Root cause

A recent release introduced a slow memory leak that affected all ingest backends at the same per-host rate, so backends OOMed within minutes of each other and the cluster as a whole had no healthy replicas during recovery windows.

The team anchored on an early hypothesis that AWS Application Load Balancers were failing because both 'backend unreachable' and 'backend timed out' responses appeared in ALB logs, and a single AWS customer report seemed to corroborate it.

The SLO burn alert was not configured to page because the SLO feature was still in beta; the alert was instead picked up by a non-on-call engineer in Europe, which led to ad-hoc handoffs and unclear ownership.

Internal telemetry from the ingest workers was lost when those workers crashed, so the team could not see the OOMs in their own dogfood data and was easily misled into thinking ingest workers themselves were healthy.

Resolution

An engineer with fresh eyes was cross-checking their mental model against service data and discovered that the ingest backends were restarting and showing climbing memory. The team reverted the bad commit, pushed a fixed release, and confirmed memory usage stayed flat with no further crashes.

Lessons

  • A memory leak that proceeds at the same rate across all replicas turns from a single-host fault into a synchronized cluster-wide outage; per-host failure isolation does nothing in this regime.
  • Confirmation bias gets stronger when the dominant theory has any external validation; a single supportive AWS customer comment was enough to redirect hours of effort toward an innocent component.
  • Crashing workers lose the in-flight telemetry that would explain why they are crashing; backpressure or graceful shutdown that flushes telemetry first preserves the evidence the team needs.
  • Beta features used to detect production incidents should still be wired into the paging path, not the lower-priority alerting path, because real incidents do not respect feature maturity.

Action items

  • Reverted the bad commit and pushed a fix; verified memory usage stays constant going forward.
  • Promote user-facing SLO alerts to actually page on-call the same way end-to-end black-box probers do.
  • Treat high process crash, panic, and OOM rates as diagnostic signals visible to debuggers, even when not used for paging.
  • Consider adding backpressure (returning unhealthy status codes) when ingest workers are resource-constrained, so internal telemetry continues flowing instead of being lost in the crash.
  • Lower the bar for declaring incidents and challenging assumptions: prefer declaring an unnecessary incident over missing a real one.