SEV-1public access

Two partial API outages from RDS CPU saturation and runaway cache-refresh queries

Honeycomb · Source

Started: Oct 4, 2018
Duration: 7d 7h 38m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: API customers globally during two ~1-hour windows plus a 2-minute maintenance interruption
Services: mysql-rds, api, ingest

cachecapacity shortfalldatabasethundering herd

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On October 4, 2018, Honeycomb suffered a partial API outage as RDS MySQL stalled at roughly 90% CPU. A second, less severe incident hit on October 11. In both cases the proximate trigger was a small event coinciding with a baseline CPU level that had drifted up to 30 to 40% over months, leaving no headroom. The deeper cause was cache-refresh queries (rate limit, sampling, blacklist) that fanned out into the thousands of concurrent queries instead of running one at a time. The team shipped a cache fix to coalesce refreshes and upgraded the RDS instance from m4.xlarge to m5.2xlarge during a brief maintenance window on October 12, leaving service stable.

Impact

Two partial API outages: roughly 54 minutes on October 4 (21:02 to 21:56 UTC) and 62 minutes on October 11 (15:00 to 16:02 UTC), plus a planned ~2-minute interruption on October 12 (4:38 to 4:40 UTC) for the RDS upgrade. During the partial outages, concurrent in-flight requests spiked into the thousands instead of the normal 1 to 3.

Root cause

Production RDS CPU baseline had drifted up to 30 to 40% as usage grew, leaving no headroom for normal variability.

Cache-refresh queries for rate limit, sampling, and blacklist data were not coalesced: each request that found the cache stale would issue its own DB query, allowing thousands of identical queries to pile up and amplify any pressure on RDS.

Several conveniently-timed minor events (a SQL surgery, a deploy with a small migration, a user creating about 5,000 teams) each contributed marginally to tipping the database over but did not individually explain the failure.

Resolution

A cache fix was deployed so that bounce information (rate limit, sampling, blacklist) was refreshed by a single query at a time rather than fanning out across all in-flight requests, dropping concurrent in-flight refresh queries from thousands to one. An emergency maintenance window then upgraded the production RDS instance from m4.xlarge to m5.2xlarge with about two minutes of total interruption.

Timeline

00:00DETECT
RDS CPU baseline has been drifting upward for some time, idling around 30 to 40% with little headroom.
mysql-rds
21:02DETECT
Partial API outage begins. RDS CPU saturates near 90%; concurrent in-flight requests spike into the thousands.
mysql-rds
21:30INVEST
Team identifies several conveniently-timed minor events (SQL surgery, deploy with migration, 5,000-team creation) but none cleanly explains the spike.
mysql-rds
21:56RESOLV
First incident self-resolves; service returns to normal but root cause is not yet fixed.
mysql-rds
22:00MITIG
Initial remediation work begins to investigate cache and query patterns.
api
15:00DETECT
A similar but less severe partial outage recurs at roughly the same time of day.
mysql-rds
15:30INVEST
Concurrent in-flight requests again spike into the thousands; the team identifies cache-refresh fan-out as the dominant factor.
api
16:02RESOLV
Second incident self-resolves; the cache-refresh coalescing fix is finalized for deployment.
api
20:00MITIG
Cache-refresh coalescing fix deployed; concurrent refresh queries collapse from thousands to one at a time.
api
04:38MITIG
Emergency maintenance window begins to upgrade RDS from m4.xlarge to m5.2xlarge.
mysql-rds
04:40RESOLV
RDS upgrade complete; service fully restored on the larger instance.
mysql-rds

Attribution

Honeycomb

By Engineering

Published Oct 12, 2018

View original source

Lessons

A high steady-state CPU baseline is a reliability bug in waiting; once you cross 30 to 40% baseline, ordinary variance starts producing outages.
Cache misses on a hot path must be coalesced or single-flighted; otherwise every concurrent request becomes its own database query during exactly the moments the database can least afford it.
When you are at the volume where engineer-hours cost more than instance upgrades for a particular database problem, the right answer is to spend the money on hardware rather than continuing to optimize.
Multiple minor near-coincidences can each look like the trigger of an outage; the actual fragility is usually a structural condition like saturation or absent coalescing.

Action items

Cache-refresh logic updated to coalesce concurrent refreshes into a single query at a time.
RDS upgraded from m4.xlarge to m5.2xlarge during an emergency maintenance window.
Documented miscellaneous stumbling points and on-call processes that caused confusion during the outage.
Continued investment in instrumenting the system during outages to learn from each event.