SEV-1public access

Misconfigured customer SLO triggers continuous Lambda backfill, exhausting shared Lambda capacity

Honeycomb · Source

Started: Aug 4, 2022
Duration: 9h
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: customers depending on triggers and SLO alerting; query performance degraded
Services: triggers, slo, lambda, basset, retriever

background jobscascading failuremanual actionsla breach

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

In early August 2022, Honeycomb's SLO measuring trigger run latency began alarming. Engineers initially attributed the issue to known prior reports of customer telemetry with future timestamps that pulled trigger queries into cold storage and onto AWS Lambda. The Incident Commander had been monitoring the future-timestamps issue, and that framing dominated the response. An engineer with fresh context later traced the bulk of Lambda usage to a single misconfigured customer SLO whose SLI never returned valid results, causing Basset to assume a backfill was needed every minute and to relaunch a 60-day cold-storage scan repeatedly. Fixing the SLO on the customer's behalf stopped the bleeding.

Impact

Trigger and SLO alerting were unreliable for roughly nine hours, with the worst impact lasting about four hours. Trigger runs spaced apart, took longer, or failed; query performance was also degraded due to contention on shared Lambda capacity.

Root cause

A single enterprise customer's SLO had an SLI that never returned valid results (true, false, or null), so Basset had no cache line for it and treated every minute's check as a fresh need to backfill.

Each backfill scanned up to 60 days of cold-storage data via AWS Lambda; with the cycle re-firing every minute, this consumed the bulk of platform-wide Lambda capacity.

Triggers and SLOs shared the same Lambda capacity pool, so a single SLO's runaway backfill degraded triggers and queries across the platform.

A pre-existing concern about a different customer's future-timestamps inflating trigger Lambda use produced a strong, plausible framing, and the engineer carrying that context became Incident Commander, which propagated the framing across responders.

Most alert signals appeared to validate the future-timestamps theory; the discrepancy between trigger-attributed Lambda usage and platform total Lambda usage went unnoticed for hours.

Resolution

An engineer not previously involved reinvestigated from first principles, traced the dominant Lambda load to a single customer SLO, and the team patched the SLO on the customer's behalf with the customer looped in. Defaults were later updated to clamp future timestamps more aggressively.

Timeline

12:00MITIG
Roughly a week before the incidents, an enterprise customer is observed sending telemetry with timestamps very far in the future. Short trigger queries begin consistently using Lambda-backed cold storage, coupling trigger performance to other query types.
ingest
15:35DETECT
Around 11:35 a.m. ET, the SLO measuring trigger runs starts alarming. Trigger runs become spaced apart, take longer, or fail.
slo
15:50INVEST
BubbleUp shows the issue distributed across runs. Initial hypothesis: triggers are exhausting Lambda capacity due to the future-timestamps customer.
lambda
17:00MITIG
Team tweaks query timeouts and internal flags to bring the situation under control; effects are partial and several red herrings consume time.
triggers
19:00INVEST
Symptoms recur. Most evidence appears to validate the future-timestamps framing inherited from the Incident Commander.
triggers
22:00INVEST
An engineer who was not on call and not part of the existing context reinvestigates. They notice that triggers' Lambda use does not match overall platform Lambda use.
lambda
22:30INVEST
Investigation traces the dominant Lambda load to Basset evaluating a single SLO from a large enterprise customer with extensive data.
basset
23:00INVEST
Discovery: the customer's SLI never returns valid results, so Basset has no cache line and assumes a backfill is needed every minute, scanning up to 60 days of cold data each time.
basset
23:30MITIG
Team fixes the SLO on the customer's behalf with the customer looped in. The runaway backfill stops.
slo
00:35RESOLV
Triggers and queries return to normal. Investigation and direct response have spanned roughly nine hours.
triggers

Attribution

Honeycomb

By Engineering

Published Aug 5, 2022

View original source

Lessons

Whoever becomes Incident Commander tends to set the dominant theory for everyone else; a strong framing without strong contradicting evidence will hold even when alert signals are merely consistent with it rather than demanding it.
Breaking out of an inadequate dominant theory usually requires an outside perspective that arrives later, when the timeline of evidence reads differently from the start than from the present.
Shared resource pools across feature areas (triggers, queries, SLOs) make a single misuse anywhere a platform-wide reliability risk; decoupling has cost but also reduces blast radius for edge cases.
Edge-case behavior of legitimate, well-loved features is far more likely to bite than abuse; treating uncommon-but-valid usage as if it were abuse leaves real surprises uncaught.
Adding tweakable controls (disable a trigger, clamp future timestamps) is consistently the cheapest mitigation lever in incidents of this shape.

Action items

SLO failure-handling behavior corrected so that an SLI that never returns valid results no longer triggers continuous backfills.
Default policy updated to clamp future-stamps more aggressively.
Constraints added at ingest time to restrict how much triggers may depend on Lambda.
Investing in better support tooling for Incident Commanders to reduce cognitive overload.
Looking into ways for on-call engineers to search and categorize feature flags when operating components they are unfamiliar with.
Considering improvements to communication so customers learn directly when their configuration is causing platform-level effects.