Back to Open Playback
SEV-1public access

Four API hosts reject 10% of event traffic for 1.5 hours, undetected for a week

Honeycomb · Source

Started
Apr 16, 2025
Duration
1h 30m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
events to four API hosts globally; about 10% of event traffic for ~1.5 hours
Services
api, ingest, cache
cachedata lossmissing monitoringregression from deploysla breach

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On April 23, 2025, Honeycomb discovered an incident that had occurred approximately one week earlier. On April 16, four of Honeycomb's API servers rejected all traffic with 500 or 401 HTTP responses for roughly 1.5 hours, dropping about 10% of event traffic. The incident went unnoticed for around a week until a customer reported missing data. Because Honeycomb retains events for 60 days and includes deployment version as a column on every event, the team was able to look back over a week and reconstruct the failure: an unexpected deployment-version mismatch on those hosts, combined with a caching bug that swallowed a database error and a cache that returned a null value as success. The team forensically diagnosed the issue using the same observability data customers rely on.

Impact

Approximately 10% of event traffic to Honeycomb's API was rejected with 500 or 401 errors for about 1.5 hours on April 16, 2025. Had 401 errors been part of the SLO at the time, the incident would have consumed roughly 81.6% of the quarterly error budget.

Root cause

Four API servers came up with an unexpected deployment version after a deploy mismatch in the deployment process, leaving them in a state that broke handling of the events they received.

A caching mechanism cached a null value and did not report that as an error, masking the underlying database failure rather than surfacing it.

Retry and circuit-breaker logic worked as intended in the sense that they prevented the database from being hammered, but they also hid the critical error that was occurring on those hosts.

The SLO did not alert because 401s were treated as occasional blips that do not indicate user pain; in this case, however, customers genuinely could not use the service.

Modern systems with automated remediations (load balancing, autoscaling, database failover) tend to mask issues that never reach a human operator; without long-retention forensic data, the incident might never have been investigated.

Resolution

The incident self-resolved on April 16 when the affected hosts were rotated out by normal deploy and autoscaling processes. After the customer report on April 23, the team used 60 days of retained high-cardinality event data to reconstruct what happened. The deployment process was fixed so new hosts cannot come up with an unexpected version, and the caching mechanism was changed to propagate database errors instead of caching null as success.

Lessons

  • Long data retention turns post-hoc forensics from impossible to routine; in modern self-healing systems, the incident may end before any human knows it started, and the only path to learning is replaying the data.
  • SLOs measure intent; a 401 that 'normally' represents a blip can represent a total outage for some users in an unusual deployment state, and an SLO that ignores the symptom misses the incident.
  • Retry and circuit-breaker logic that protects downstream dependencies can simultaneously hide the upstream symptom, especially when paired with caches that store negative results as success.
  • Modern systems' graceful degradation makes 'caught by automation' the most common failure path, which means observability data with high-cardinality fields like deployment version and host is the principal feedback loop.
  • Deploy version as a high-cardinality column on every event is a remarkably high-leverage instrumentation choice for diagnosing deployment-related faults after the fact.

Action items

  • Fixed the deployment process so new hosts cannot come up with an unexpected version.
  • Fixed the caching mechanism so database errors propagate properly instead of being cached as null successes.
  • Reviewed SLO definitions in light of how 401 responses can represent customer pain in some failure modes.