Back to Open Playback
SEV-1public access

Two partial API outages from RDS CPU saturation and runaway cache-refresh queries

Honeycomb · Source

Started
Oct 4, 2018
Duration
7d 7h 38m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
API customers globally during two ~1-hour windows plus a 2-minute maintenance interruption
Services
mysql-rds, api, ingest
cachecapacity shortfalldatabasethundering herd

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On October 4, 2018, Honeycomb suffered a partial API outage as RDS MySQL stalled at roughly 90% CPU. A second, less severe incident hit on October 11. In both cases the proximate trigger was a small event coinciding with a baseline CPU level that had drifted up to 30 to 40% over months, leaving no headroom. The deeper cause was cache-refresh queries (rate limit, sampling, blacklist) that fanned out into the thousands of concurrent queries instead of running one at a time. The team shipped a cache fix to coalesce refreshes and upgraded the RDS instance from m4.xlarge to m5.2xlarge during a brief maintenance window on October 12, leaving service stable.

Impact

Two partial API outages: roughly 54 minutes on October 4 (21:02 to 21:56 UTC) and 62 minutes on October 11 (15:00 to 16:02 UTC), plus a planned ~2-minute interruption on October 12 (4:38 to 4:40 UTC) for the RDS upgrade. During the partial outages, concurrent in-flight requests spiked into the thousands instead of the normal 1 to 3.

Root cause

Production RDS CPU baseline had drifted up to 30 to 40% as usage grew, leaving no headroom for normal variability.

Cache-refresh queries for rate limit, sampling, and blacklist data were not coalesced: each request that found the cache stale would issue its own DB query, allowing thousands of identical queries to pile up and amplify any pressure on RDS.

Several conveniently-timed minor events (a SQL surgery, a deploy with a small migration, a user creating about 5,000 teams) each contributed marginally to tipping the database over but did not individually explain the failure.

Resolution

A cache fix was deployed so that bounce information (rate limit, sampling, blacklist) was refreshed by a single query at a time rather than fanning out across all in-flight requests, dropping concurrent in-flight refresh queries from thousands to one. An emergency maintenance window then upgraded the production RDS instance from m4.xlarge to m5.2xlarge with about two minutes of total interruption.

Lessons

  • A high steady-state CPU baseline is a reliability bug in waiting; once you cross 30 to 40% baseline, ordinary variance starts producing outages.
  • Cache misses on a hot path must be coalesced or single-flighted; otherwise every concurrent request becomes its own database query during exactly the moments the database can least afford it.
  • When you are at the volume where engineer-hours cost more than instance upgrades for a particular database problem, the right answer is to spend the money on hardware rather than continuing to optimize.
  • Multiple minor near-coincidences can each look like the trigger of an outage; the actual fragility is usually a structural condition like saturation or absent coalescing.

Action items

  • Cache-refresh logic updated to coalesce concurrent refreshes into a single query at a time.
  • RDS upgraded from m4.xlarge to m5.2xlarge during an emergency maintenance window.
  • Documented miscellaneous stumbling points and on-call processes that caused confusion during the outage.
  • Continued investment in instrumenting the system during outages to learn from each event.