SEV-0public access

Sudden RDS MySQL performance collapse causes near-total Honeycomb outage

Honeycomb · Source

Started: May 3, 2018
Duration: 1d
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: global; nearly complete service outage
Services: mysql-rds, api, ingest

cascading failureconnection pool exhaustiondata lossdatabasedegraded performance

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On May 3, 2018, the production RDS MySQL instance backing Honeycomb's API experienced a sudden and dramatic performance collapse, with P95 query time jumping from 11 ms to over 1000 ms in roughly 20 seconds while write throughput dropped from 780 ops per second to 5. The application stack reacted by saturating the connection pool with retries, and the service was almost entirely unavailable for approximately 24 hours. Only about 15% of incoming events were successfully stored during the outage, though customer-side buffering allowed many to be replayed afterward. The team initially worried that a Go application bug might be hammering the database, but later confirmed the root issue was at the database layer.

Impact

Honeycomb's API was almost entirely unavailable for about 24 hours. Around 15% of events were stored during the outage; the rest were rejected, though clients that buffered to disk were able to resubmit successfully once the service recovered.

Root cause

The RDS MySQL instance suffered a sudden, severe storage or compute degradation: P95 query time rose from 11 ms to over 1000 ms while write throughput collapsed from roughly 780 to 5 operations per second within about 20 seconds.

When database connections began failing, the application stack and Go libraries retried aggressively in a way that exhausted the connection pool and produced Error 1040 (too many connections), amplifying the original degradation.

Service-side recovery was slowed by uncertainty about whether the underlying cause was infrastructure or an application bug, leading to extra time spent reproducing failure modes safely before changes were made.

Resolution

After confirming through controlled reproduction that running out of connections was a symptom rather than the cause, the team focused on RDS-level recovery and brought the service back over the following hours. Customer telemetry that had been buffered client-side was successfully replayed once the API was healthy again.

Timeline

00:39DETECT
RDS MySQL P95 query time goes from 11 ms to over 1000 ms in roughly 20 seconds; write throughput collapses from 780/s to 5/s.
mysql-rds
00:42DETECT
API services begin failing as connection pools saturate; customers see widespread errors.
api
01:00INVEST
Initial hypothesis that a Go library bug is hammering the database during connection failures is investigated alongside infrastructure theories.
api
03:00INVEST
Team decides to reproduce the Error 1040 connection exhaustion in a controlled dogfood environment to isolate cause from symptom.
mysql-rds
08:00INVEST
Reproduction confirms connection exhaustion is downstream of the database slowdown rather than its trigger; focus shifts to RDS itself.
mysql-rds
14:00MITIG
RDS-level mitigations and recovery actions begin; partial service restoration starts to land.
mysql-rds
20:00MITIG
API begins accepting traffic again at reduced capacity; about 15% of events have been stored over the outage window.
api
00:39RESOLV
Service fully restored after roughly 24 hours; client-side buffered events begin replaying successfully.
api

Attribution

Honeycomb

By Engineering

Published May 4, 2018

View original source

Lessons

When a database goes from healthy to catastrophically slow within seconds, the application's retry behavior tends to convert a degradation into an outage; sane connection limits and circuit breakers matter more than perfect retry logic.
It is tempting to blame the network or the database first, but resisting that bias and reproducing the failure mode in a controlled environment yields a more reliable diagnosis.
Customer-side buffering can dramatically reduce data loss for telemetry pipelines and is worth making a first-class part of the SDK contract.

Action items

Multiple improvements to make outages of this kind less likely to cause data loss, and to recover faster if they recur.
Better instrumentation around connection pool behavior and Go client retry patterns.
Customer-facing apology and follow-up communication acknowledging data stewardship responsibilities.