Single-customer traffic surge saturates Cloudflare-AWS us-east-1 peering
Cloudflare · Source
- Started
- Aug 21, 2025
- Duration
- 3h 51m
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- Customers with origins in AWS us-east-1, primarily traffic transiting Cloudflare's Ashburn (IAD) edge
- Services
- network, peering, ashburn-edge, cdn
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
At 16:27 UTC, a single Cloudflare customer began pulling cached objects from AWS us-east-1 at a rate that doubled total Cloudflare-to-AWS traffic and saturated all direct peering links into us-east-1. AWS attempted to alleviate the congestion by withdrawing BGP advertisements over the saturated links, which rerouted traffic to an offsite peering switch that promptly saturated as well. Two pre-existing infrastructure conditions made the impact worse: one direct peering link was at half capacity due to a known failure, and the Data Center Interconnect to the offsite switch was due for a capacity upgrade. After three hours of manual traffic engineering between Cloudflare and AWS plus rate-limiting the customer, congestion fully resolved at 20:18 UTC.
Impact
Traffic between Cloudflare and AWS us-east-1 saw high latency, packet loss, and connection failures for over three hours. Customers with origins in us-east-1 routed through Cloudflare experienced degraded performance and elevated 5xx errors. Global Cloudflare services and other regions were not affected.
Root cause
A single customer suddenly started pulling cached objects from us-east-1 at a rate that saturated all direct peering connections between Cloudflare and AWS.
AWS withdrew BGP advertisements over saturated PNIs to attempt to mitigate congestion; this rerouted traffic onto a backup path through an offsite peering switch that was also undersized for the load.
One direct peering link to AWS was already operating at half capacity due to an unresolved hardware failure.
The Data Center Interconnect between Cloudflare's edge routers and the offsite peering switch was scheduled for a capacity upgrade that had not yet been completed.
There was no per-customer network-resource budget, so a single customer's traffic could degrade service for every other customer sharing the same paths.
Resolution
Engineers and AWS partners coordinated manual traffic engineering and rate-limiting against the source customer to bring congestion down. AWS reverted the BGP withdrawals starting at 19:45 UTC, and Cloudflare confirmed BGP normalization at 20:07 UTC. Customer-impacting latency cleared by 20:18 UTC.
Lessons
- Without per-customer traffic budgets, a single customer's legitimate traffic surge becomes everyone's outage on shared paths.
- Coordinated mitigations across two networks can compound: when AWS withdrew prefixes to relieve congestion, the rerouted traffic saturated the backup path.
- Pre-existing capacity issues (a half-capacity link and an overdue DCI upgrade) absorb headroom that incident response would otherwise have during a surge.
- Manual traffic engineering between two networks is slow under pressure; incidents like this surface the gap between human-paced response and machine-paced traffic.
- BGP traffic engineering between peers should be coordinated, not reflexive, because each side's mitigation can land on the other's backup capacity.
Action items
- Develop a mechanism to selectively deprioritize a single customer's traffic when it begins to congest the network.
- Expedite Data Center Interconnect upgrades to provide network capacity significantly above current levels.
- Coordinate with AWS so each side's BGP traffic engineering does not conflict with the other's.
- Build a per-customer network resource budget system that prevents one customer's traffic from degrading service for others.
- Automate manual traffic-engineering actions taken during this incident so future congestion events resolve faster.