Back to Open Playback
SEV-1public access

Misconfigured customer SLO triggers continuous Lambda backfill, exhausting shared Lambda capacity

Honeycomb · Source

Started
Aug 4, 2022
Duration
9h
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
customers depending on triggers and SLO alerting; query performance degraded
Services
triggers, slo, lambda, basset, retriever
background jobscascading failuremanual actionsla breach

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

In early August 2022, Honeycomb's SLO measuring trigger run latency began alarming. Engineers initially attributed the issue to known prior reports of customer telemetry with future timestamps that pulled trigger queries into cold storage and onto AWS Lambda. The Incident Commander had been monitoring the future-timestamps issue, and that framing dominated the response. An engineer with fresh context later traced the bulk of Lambda usage to a single misconfigured customer SLO whose SLI never returned valid results, causing Basset to assume a backfill was needed every minute and to relaunch a 60-day cold-storage scan repeatedly. Fixing the SLO on the customer's behalf stopped the bleeding.

Impact

Trigger and SLO alerting were unreliable for roughly nine hours, with the worst impact lasting about four hours. Trigger runs spaced apart, took longer, or failed; query performance was also degraded due to contention on shared Lambda capacity.

Root cause

A single enterprise customer's SLO had an SLI that never returned valid results (true, false, or null), so Basset had no cache line for it and treated every minute's check as a fresh need to backfill.

Each backfill scanned up to 60 days of cold-storage data via AWS Lambda; with the cycle re-firing every minute, this consumed the bulk of platform-wide Lambda capacity.

Triggers and SLOs shared the same Lambda capacity pool, so a single SLO's runaway backfill degraded triggers and queries across the platform.

A pre-existing concern about a different customer's future-timestamps inflating trigger Lambda use produced a strong, plausible framing, and the engineer carrying that context became Incident Commander, which propagated the framing across responders.

Most alert signals appeared to validate the future-timestamps theory; the discrepancy between trigger-attributed Lambda usage and platform total Lambda usage went unnoticed for hours.

Resolution

An engineer not previously involved reinvestigated from first principles, traced the dominant Lambda load to a single customer SLO, and the team patched the SLO on the customer's behalf with the customer looped in. Defaults were later updated to clamp future timestamps more aggressively.

Lessons

  • Whoever becomes Incident Commander tends to set the dominant theory for everyone else; a strong framing without strong contradicting evidence will hold even when alert signals are merely consistent with it rather than demanding it.
  • Breaking out of an inadequate dominant theory usually requires an outside perspective that arrives later, when the timeline of evidence reads differently from the start than from the present.
  • Shared resource pools across feature areas (triggers, queries, SLOs) make a single misuse anywhere a platform-wide reliability risk; decoupling has cost but also reduces blast radius for edge cases.
  • Edge-case behavior of legitimate, well-loved features is far more likely to bite than abuse; treating uncommon-but-valid usage as if it were abuse leaves real surprises uncaught.
  • Adding tweakable controls (disable a trigger, clamp future timestamps) is consistently the cheapest mitigation lever in incidents of this shape.

Action items

  • SLO failure-handling behavior corrected so that an SLI that never returns valid results no longer triggers continuous backfills.
  • Default policy updated to clamp future-stamps more aggressively.
  • Constraints added at ingest time to restrict how much triggers may depend on Lambda.
  • Investing in better support tooling for Incident Commanders to reduce cognitive overload.
  • Looking into ways for on-call engineers to search and categorize feature flags when operating components they are unfamiliar with.
  • Considering improvements to communication so customers learn directly when their configuration is causing platform-level effects.