Back to Open Playback
SEV-0public access

Sudden RDS MySQL performance collapse causes near-total Honeycomb outage

Honeycomb · Source

Started
May 3, 2018
Duration
1d
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
global; nearly complete service outage
Services
mysql-rds, api, ingest
cascading failureconnection pool exhaustiondata lossdatabasedegraded performance

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On May 3, 2018, the production RDS MySQL instance backing Honeycomb's API experienced a sudden and dramatic performance collapse, with P95 query time jumping from 11 ms to over 1000 ms in roughly 20 seconds while write throughput dropped from 780 ops per second to 5. The application stack reacted by saturating the connection pool with retries, and the service was almost entirely unavailable for approximately 24 hours. Only about 15% of incoming events were successfully stored during the outage, though customer-side buffering allowed many to be replayed afterward. The team initially worried that a Go application bug might be hammering the database, but later confirmed the root issue was at the database layer.

Impact

Honeycomb's API was almost entirely unavailable for about 24 hours. Around 15% of events were stored during the outage; the rest were rejected, though clients that buffered to disk were able to resubmit successfully once the service recovered.

Root cause

The RDS MySQL instance suffered a sudden, severe storage or compute degradation: P95 query time rose from 11 ms to over 1000 ms while write throughput collapsed from roughly 780 to 5 operations per second within about 20 seconds.

When database connections began failing, the application stack and Go libraries retried aggressively in a way that exhausted the connection pool and produced Error 1040 (too many connections), amplifying the original degradation.

Service-side recovery was slowed by uncertainty about whether the underlying cause was infrastructure or an application bug, leading to extra time spent reproducing failure modes safely before changes were made.

Resolution

After confirming through controlled reproduction that running out of connections was a symptom rather than the cause, the team focused on RDS-level recovery and brought the service back over the following hours. Customer telemetry that had been buffered client-side was successfully replayed once the API was healthy again.

Lessons

  • When a database goes from healthy to catastrophically slow within seconds, the application's retry behavior tends to convert a degradation into an outage; sane connection limits and circuit breakers matter more than perfect retry logic.
  • It is tempting to blame the network or the database first, but resisting that bias and reproducing the failure mode in a controlled environment yields a more reliable diagnosis.
  • Customer-side buffering can dramatically reduce data loss for telemetry pipelines and is worth making a first-class part of the SDK contract.

Action items

  • Multiple improvements to make outages of this kind less likely to cause data loss, and to recover faster if they recur.
  • Better instrumentation around connection pool behavior and Go client retry patterns.
  • Customer-facing apology and follow-up communication acknowledging data stewardship responsibilities.