SEV-1public access

Single-customer traffic surge saturates Cloudflare-AWS us-east-1 peering

Cloudflare · Source

Started: Aug 21, 2025
Duration: 3h 51m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: Customers with origins in AWS us-east-1, primarily traffic transiting Cloudflare's Ashburn (IAD) edge
Services: network, peering, ashburn-edge, cdn

bgp misconfigurationcapacity shortfallcustomer-facingnetworktraffic spike

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

At 16:27 UTC, a single Cloudflare customer began pulling cached objects from AWS us-east-1 at a rate that doubled total Cloudflare-to-AWS traffic and saturated all direct peering links into us-east-1. AWS attempted to alleviate the congestion by withdrawing BGP advertisements over the saturated links, which rerouted traffic to an offsite peering switch that promptly saturated as well. Two pre-existing infrastructure conditions made the impact worse: one direct peering link was at half capacity due to a known failure, and the Data Center Interconnect to the offsite switch was due for a capacity upgrade. After three hours of manual traffic engineering between Cloudflare and AWS plus rate-limiting the customer, congestion fully resolved at 20:18 UTC.

Impact

Traffic between Cloudflare and AWS us-east-1 saw high latency, packet loss, and connection failures for over three hours. Customers with origins in us-east-1 routed through Cloudflare experienced degraded performance and elevated 5xx errors. Global Cloudflare services and other regions were not affected.

Root cause

A single customer suddenly started pulling cached objects from us-east-1 at a rate that saturated all direct peering connections between Cloudflare and AWS.

AWS withdrew BGP advertisements over saturated PNIs to attempt to mitigate congestion; this rerouted traffic onto a backup path through an offsite peering switch that was also undersized for the load.

One direct peering link to AWS was already operating at half capacity due to an unresolved hardware failure.

The Data Center Interconnect between Cloudflare's edge routers and the offsite peering switch was scheduled for a capacity upgrade that had not yet been completed.

There was no per-customer network-resource budget, so a single customer's traffic could degrade service for every other customer sharing the same paths.

Resolution

Engineers and AWS partners coordinated manual traffic engineering and rate-limiting against the source customer to bring congestion down. AWS reverted the BGP withdrawals starting at 19:45 UTC, and Cloudflare confirmed BGP normalization at 20:07 UTC. Customer-impacting latency cleared by 20:18 UTC.

Timeline

16:27DETECT
Single-customer traffic surge from AWS us-east-1 begins, doubling total Cloudflare-to-AWS traffic; direct peering links begin saturating.
peering
16:37MITIG
AWS begins withdrawing BGP prefixes from Cloudflare on congested PNI sessions to attempt to relieve pressure.
bgp
16:44DETECT
Cloudflare network team is alerted to internal congestion in Ashburn (IAD).
ashburn-edge
16:45INVEST
Network team evaluates response options; AWS prefixes are unavailable on uncongested paths because of the BGP withdrawals.
bgp
17:22DETECT
AWS BGP withdrawals push more traffic onto the offsite peering switch; dropped traffic increases.
peering
17:45INVEST
Customer-impact incident is formally raised for Ashburn (IAD).
ashburn-edge
19:05MITIG
Rate limiting against the source customer reduces traffic and decreases congestion.
network
19:27MITIG
Additional traffic-engineering actions by the network team fully resolve congestion on the affected paths.
network
19:45MITIG
AWS begins reverting BGP prefix withdrawals as requested by Cloudflare.
bgp
20:07MITIG
AWS finishes normalizing BGP prefix announcements over IAD PNIs.
bgp
20:18RESOLV
Long tail of latency from prefix renormalization clears; impact ends.
network

Attribution

Cloudflare

By Network

Published Aug 21, 2025

View original source

Lessons

Without per-customer traffic budgets, a single customer's legitimate traffic surge becomes everyone's outage on shared paths.
Coordinated mitigations across two networks can compound: when AWS withdrew prefixes to relieve congestion, the rerouted traffic saturated the backup path.
Pre-existing capacity issues (a half-capacity link and an overdue DCI upgrade) absorb headroom that incident response would otherwise have during a surge.
Manual traffic engineering between two networks is slow under pressure; incidents like this surface the gap between human-paced response and machine-paced traffic.
BGP traffic engineering between peers should be coordinated, not reflexive, because each side's mitigation can land on the other's backup capacity.

Action items

Develop a mechanism to selectively deprioritize a single customer's traffic when it begins to congest the network.
Expedite Data Center Interconnect upgrades to provide network capacity significantly above current levels.
Coordinate with AWS so each side's BGP traffic engineering does not conflict with the other's.
Build a per-customer network resource budget system that prevents one customer's traffic from degrading service for others.
Automate manual traffic-engineering actions taken during this incident so future congestion events resolve faster.