Back to Open Playback
SEV-1public access

BI telemetry change silently breaks 94% of trigger notification emails for four days

Honeycomb · Source

Started
Nov 18, 2021
Duration
4d 14h 6m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
all customers depending on email trigger notifications globally
Services
email-notifications, triggers, third-party-email-sdk
email deliverymissing monitoringregression from deploysla breachthird-party saas

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On November 18, 2021, between 00:50 and 00:56 UTC, an update intended to improve business-intelligence telemetry from production was deployed. It contained a defect in how the third-party email SDK was used: the response object had to be inspected for hidden errors that did not appear in Go's idiomatic error return value. About 94.1% of trigger notification emails silently failed to send for the next four days. The instrumentation gap meant the email SLO did not detect the failures, automated tests passed because the third-party API was mocked, and the issue went unnoticed until a customer reported it on November 22 at 14:56 UTC.

Impact

Approximately 94.1% of email notifications for triggers were not delivered between November 18 at 00:50 UTC and November 22 at 14:56 UTC. The SLO that was supposed to track email deliverability did not flag the failures because the failure mode was not surfaced by the existing instrumentation.

Root cause

A new line of code added BI telemetry metadata to a request through the third-party email SDK in a way that subtly differed from how the same SDK was used elsewhere in the product, likely due to escaping behavior of the metadata values.

The SDK signaled the resulting failure through the response object rather than only through Go's tuple-returned error, and existing application logic did not inspect the response for hidden errors.

The request and response were not instrumented and auto-instrumentation did not reach into this SDK, so the failures were invisible in Honeycomb's own dogfood and kibble environments.

Automated tests mocked the third-party API; manual testing covered account-management email scenarios but not trigger notifications.

The email SLO was based on signals that did not catch this failure mode, so it gave false confidence that emails were being delivered.

Resolution

Once the customer report arrived, the team identified the single new line of code that could have changed trigger email behavior and inferred that the metadata values were probably not being escaped correctly. They confirmed the fix in the dogfood environment and rolled it out, then added instrumentation around the request and response and validated that the SLO now detects the previously hidden errors.

Lessons

  • Tests that mock third-party APIs and SLOs that measure the wrong dimensions can both quietly endorse a broken system; redundancy in observation does not equal redundancy in coverage if both observation paths share the same blind spot.
  • Adding observability into the development process - inspecting traces of the change before and after - would have surfaced the failure as soon as the engineer looked at an event trace.
  • Boundaries with third-party SDKs are exactly where you cannot rely on language idioms for error handling; SDK-specific error patterns must be wrapped in instrumentation rather than trusted.
  • Customer reports are not a monitoring strategy; if your SLO would have missed an outage of this magnitude, the SLO is the bug.

Action items

  • Added instrumentation points across the relevant email-sending paths so failures show up in tracing.
  • Adopted an integration testing interface provided by the email partner to cover this code path in automated tests.
  • Verified the email SLO now sees previously hidden failure modes.
  • Cleaned up additional usages of this SDK to reduce the surface area of similar gaps.
BI telemetry change silently breaks 94% of trigger notification emails for four days | Open Playback | Aftermath