Two partial API outages from RDS CPU saturation and runaway cache-refresh queries
Honeycomb · Source
- Started
- Oct 4, 2018
- Duration
- 7d 7h 38m
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- API customers globally during two ~1-hour windows plus a 2-minute maintenance interruption
- Services
- mysql-rds, api, ingest
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
On October 4, 2018, Honeycomb suffered a partial API outage as RDS MySQL stalled at roughly 90% CPU. A second, less severe incident hit on October 11. In both cases the proximate trigger was a small event coinciding with a baseline CPU level that had drifted up to 30 to 40% over months, leaving no headroom. The deeper cause was cache-refresh queries (rate limit, sampling, blacklist) that fanned out into the thousands of concurrent queries instead of running one at a time. The team shipped a cache fix to coalesce refreshes and upgraded the RDS instance from m4.xlarge to m5.2xlarge during a brief maintenance window on October 12, leaving service stable.
Impact
Two partial API outages: roughly 54 minutes on October 4 (21:02 to 21:56 UTC) and 62 minutes on October 11 (15:00 to 16:02 UTC), plus a planned ~2-minute interruption on October 12 (4:38 to 4:40 UTC) for the RDS upgrade. During the partial outages, concurrent in-flight requests spiked into the thousands instead of the normal 1 to 3.
Root cause
Production RDS CPU baseline had drifted up to 30 to 40% as usage grew, leaving no headroom for normal variability.
Cache-refresh queries for rate limit, sampling, and blacklist data were not coalesced: each request that found the cache stale would issue its own DB query, allowing thousands of identical queries to pile up and amplify any pressure on RDS.
Several conveniently-timed minor events (a SQL surgery, a deploy with a small migration, a user creating about 5,000 teams) each contributed marginally to tipping the database over but did not individually explain the failure.
Resolution
A cache fix was deployed so that bounce information (rate limit, sampling, blacklist) was refreshed by a single query at a time rather than fanning out across all in-flight requests, dropping concurrent in-flight refresh queries from thousands to one. An emergency maintenance window then upgraded the production RDS instance from m4.xlarge to m5.2xlarge with about two minutes of total interruption.
Lessons
- A high steady-state CPU baseline is a reliability bug in waiting; once you cross 30 to 40% baseline, ordinary variance starts producing outages.
- Cache misses on a hot path must be coalesced or single-flighted; otherwise every concurrent request becomes its own database query during exactly the moments the database can least afford it.
- When you are at the volume where engineer-hours cost more than instance upgrades for a particular database problem, the right answer is to spend the money on hardware rather than continuing to optimize.
- Multiple minor near-coincidences can each look like the trigger of an outage; the actual fragility is usually a structural condition like saturation or absent coalescing.
Action items
- Cache-refresh logic updated to coalesce concurrent refreshes into a single query at a time.
- RDS upgraded from m4.xlarge to m5.2xlarge during an emergency maintenance window.
- Documented miscellaneous stumbling points and on-call processes that caused confusion during the outage.
- Continued investment in instrumenting the system during outages to learn from each event.