SEV-1public access

Slow memory leak across all ingest backends causes four 20-minute brownouts

Honeycomb · Source

Started: Nov 6, 2019
Duration: 12h
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: 1-3% of customer telemetry rejected during four ~20-minute windows
Services: ingest, alb, slo

load balancermemory leakmissing monitoringout of memorysla breach

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On November 6, 2019, a slow memory leak introduced in a recent release leaked at the same rate across every ingest backend. All backends therefore ran out of memory and crashed within minutes of one another, causing requests in flight to fail and new requests to find no healthy backend. This produced four roughly 20-minute brownouts that rejected 1-3% of incoming telemetry. The SLO burn alert detected the issue within minutes, but resolution took several hours because the team locked into a confirmation-bias hypothesis blaming AWS Application Load Balancers and waited on a support ticket while the leak quietly recurred. A fresh-eyed engineer eventually noticed the actual symptom: process restarts and steadily climbing memory.

Impact

1-3% of customer telemetry was rejected at ingest during four roughly 20-minute brownouts spread across the day. Some end-to-end probers timed out and recovered immediately, masking the issue from traditional black-box monitoring.

Root cause

A recent release introduced a slow memory leak that affected all ingest backends at the same per-host rate, so backends OOMed within minutes of each other and the cluster as a whole had no healthy replicas during recovery windows.

The team anchored on an early hypothesis that AWS Application Load Balancers were failing because both 'backend unreachable' and 'backend timed out' responses appeared in ALB logs, and a single AWS customer report seemed to corroborate it.

The SLO burn alert was not configured to page because the SLO feature was still in beta; the alert was instead picked up by a non-on-call engineer in Europe, which led to ad-hoc handoffs and unclear ownership.

Internal telemetry from the ingest workers was lost when those workers crashed, so the team could not see the OOMs in their own dogfood data and was easily misled into thinking ingest workers themselves were healthy.

Resolution

An engineer with fresh eyes was cross-checking their mental model against service data and discovered that the ingest backends were restarting and showing climbing memory. The team reverted the bad commit, pushed a fixed release, and confirmed memory usage stayed flat with no further crashes.

Timeline

20:00MITIG
A release containing the unnoticed memory leak is deployed. Memory begins climbing slowly on all ingest backends.
ingest
08:00DETECT
First brownout: all ingest backends OOM within minutes of each other; 1-3% of telemetry is rejected for roughly 20 minutes.
ingest
08:05DETECT
SLO burn alert fires within minutes. Because it is non-paging, the alert is picked up by an engineer in Europe who is not on-call.
slo
08:30INVEST
EU engineer investigates, escalates to the US on-call, and makes a public Slack post asking other AWS customers about errors. No formal incident is declared.
ingest
10:00INVEST
Team narrows on the hypothesis that AWS ALBs are failing, supported by 'backend unreachable' and 'backend timed out' messages in ALB logs and a single corroborating AWS customer report.
alb
11:00MITIG
An AWS support ticket is filed; team prepares a 12+ hour wait while the brownouts continue at roughly 20-minute intervals on a multi-hour cycle.
alb
16:00INVEST
US Pacific engineers come online and brainstorm fallback options like switching back to ELBs in case the ALB theory is correct.
alb
18:00INVEST
A fresh-eyed engineer cross-checks their mental model against service data and discovers the actual symptom: process restarts and climbing memory across all ingest backends.
ingest
19:30MITIG
The bad commit is reverted and a fixed release is pushed.
ingest
20:00RESOLV
Memory usage stays flat at the new release; no further crashes occur. Service confirmed stable.
ingest

Attribution

Honeycomb

By Engineering

Published Nov 6, 2019

View original source

Lessons

A memory leak that proceeds at the same rate across all replicas turns from a single-host fault into a synchronized cluster-wide outage; per-host failure isolation does nothing in this regime.
Confirmation bias gets stronger when the dominant theory has any external validation; a single supportive AWS customer comment was enough to redirect hours of effort toward an innocent component.
Crashing workers lose the in-flight telemetry that would explain why they are crashing; backpressure or graceful shutdown that flushes telemetry first preserves the evidence the team needs.
Beta features used to detect production incidents should still be wired into the paging path, not the lower-priority alerting path, because real incidents do not respect feature maturity.

Action items

Reverted the bad commit and pushed a fix; verified memory usage stays constant going forward.
Promote user-facing SLO alerts to actually page on-call the same way end-to-end black-box probers do.
Treat high process crash, panic, and OOM rates as diagnostic signals visible to debuggers, even when not used for paging.
Consider adding backpressure (returning unhealthy status codes) when ingest workers are resource-constrained, so internal telemetry continues flowing instead of being lost in the crash.
Lower the bar for declaring incidents and challenging assumptions: prefer declaring an unnecessary incident over missing a real one.