SEV-1public access

Four API hosts reject 10% of event traffic for 1.5 hours, undetected for a week

Honeycomb · Source

Started: Apr 16, 2025
Duration: 1h 30m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: events to four API hosts globally; about 10% of event traffic for ~1.5 hours
Services: api, ingest, cache

cachedata lossmissing monitoringregression from deploysla breach

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On April 23, 2025, Honeycomb discovered an incident that had occurred approximately one week earlier. On April 16, four of Honeycomb's API servers rejected all traffic with 500 or 401 HTTP responses for roughly 1.5 hours, dropping about 10% of event traffic. The incident went unnoticed for around a week until a customer reported missing data. Because Honeycomb retains events for 60 days and includes deployment version as a column on every event, the team was able to look back over a week and reconstruct the failure: an unexpected deployment-version mismatch on those hosts, combined with a caching bug that swallowed a database error and a cache that returned a null value as success. The team forensically diagnosed the issue using the same observability data customers rely on.

Impact

Approximately 10% of event traffic to Honeycomb's API was rejected with 500 or 401 errors for about 1.5 hours on April 16, 2025. Had 401 errors been part of the SLO at the time, the incident would have consumed roughly 81.6% of the quarterly error budget.

Root cause

Four API servers came up with an unexpected deployment version after a deploy mismatch in the deployment process, leaving them in a state that broke handling of the events they received.

A caching mechanism cached a null value and did not report that as an error, masking the underlying database failure rather than surfacing it.

Retry and circuit-breaker logic worked as intended in the sense that they prevented the database from being hammered, but they also hid the critical error that was occurring on those hosts.

The SLO did not alert because 401s were treated as occasional blips that do not indicate user pain; in this case, however, customers genuinely could not use the service.

Modern systems with automated remediations (load balancing, autoscaling, database failover) tend to mask issues that never reach a human operator; without long-retention forensic data, the incident might never have been investigated.

Resolution

The incident self-resolved on April 16 when the affected hosts were rotated out by normal deploy and autoscaling processes. After the customer report on April 23, the team used 60 days of retained high-cardinality event data to reconstruct what happened. The deployment process was fixed so new hosts cannot come up with an unexpected version, and the caching mechanism was changed to propagate database errors instead of caching null as success.

Timeline

00:00MITIG
Four API servers come up with an unexpected deployment version due to a deploy-process flaw, putting them in a state where they reject most traffic.
api
00:05DETECT
Affected hosts begin returning 500 and 401 responses for all traffic. About 10% of event traffic is being rejected. SLO does not fire because 401s are treated as benign.
api
00:30DETECT
Caching mechanism caches a null value and reports it as success. Retries and circuit-breaker prevent the database from being hammered but hide the symptom from operators.
cache
01:30RESOLV
Affected hosts are rotated out by normal deploy and autoscaling cycles. Service returns to normal. No human operator knows the incident occurred.
api
15:00DETECT
A customer reports missing data from April 16. Honeycomb opens a major incident to investigate retroactively.
api
15:30INVEST
Team uses 60-day retention to look back across all events. Comparing host, deployment version, and HTTP response columns immediately reveals the version mismatch on four hosts.
api
17:00INVEST
Investigation traces the symptom to a caching mechanism that stored a null value as success and a deploy process that allowed hosts to come up at the wrong version.
cache
19:00MITIG
Deploy process and caching error-propagation fixes prepared and queued for rollout.
api
12:00RESOLV
Fixes shipped. SLO definitions reviewed for how 401 responses should be treated going forward.
api

Attribution

Honeycomb

By Engineering

Published Apr 16, 2025

View original source

Lessons

Long data retention turns post-hoc forensics from impossible to routine; in modern self-healing systems, the incident may end before any human knows it started, and the only path to learning is replaying the data.
SLOs measure intent; a 401 that 'normally' represents a blip can represent a total outage for some users in an unusual deployment state, and an SLO that ignores the symptom misses the incident.
Retry and circuit-breaker logic that protects downstream dependencies can simultaneously hide the upstream symptom, especially when paired with caches that store negative results as success.
Modern systems' graceful degradation makes 'caught by automation' the most common failure path, which means observability data with high-cardinality fields like deployment version and host is the principal feedback loop.
Deploy version as a high-cardinality column on every event is a remarkably high-leverage instrumentation choice for diagnosing deployment-related faults after the fact.

Action items

Fixed the deployment process so new hosts cannot come up with an unexpected version.
Fixed the caching mechanism so database errors propagate properly instead of being cached as null successes.
Reviewed SLO definitions in light of how 401 responses can represent customer pain in some failure modes.