Back to Open Playback
SEV-1public access

BYOIP prefixes withdrawn after Addressing API cleanup task misinterprets empty filter parameter

Cloudflare · Source

Started
Feb 20, 2026
Duration
5h 7m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
BYOIP customers globally (about 25% of Cloudflare's BYOIP prefixes)
Services
addressing-api, bgp, byoip, magic-transit, spectrum, cdn, 1.1.1.1
bgp misconfigurationconfiguration errordeployhuman errornetworkrollback

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

An automated cleanup sub-task in Cloudflare's Addressing API incorrectly queried the API with an empty pending_delete parameter, which the server interpreted as a request for all BYOIP prefixes. The task then began systematically deleting all matching prefixes and their service bindings, withdrawing about 1,100 BGP prefixes from the Internet. Engineers stopped the runaway sub-task within 50 minutes, but full restoration took over six hours because some customers had service bindings stripped from edge servers and required a global configuration rollout to repair.

Impact

About 1,100 BYOIP prefixes were withdrawn from BGP, leaving affected customer services unreachable and triggering BGP path hunting where end-user connections looped until they timed out. Cloudflare products dependent on BYOIP advertisement (CDN, Spectrum, Magic Transit, Dedicated Egress) failed to attract traffic, and the one.one.one.one site returned HTTP 403 errors.

Root cause

An automated cleanup sub-task issued a GET request with an empty pending_delete query parameter; the API server treated the empty string as if the filter were absent and returned all BYOIP prefixes, which the task then queued for deletion.

The cleanup logic was a new automation built to replace a manual customer-prefix-removal workflow as part of the Code Orange: Fail Small initiative.

Staging environment data and existing tests did not exercise the code path where the task-runner service modifies user data without explicit input, so the bug was not caught before production.

Customer addressing state and operational state share the same authoritative database, so there was no clean snapshot to roll back to and engineers had to manually rebuild service bindings.

Resolution

Engineers terminated the runaway sub-task at 18:46 UTC and disabled scheduled execution. Most affected customers self-remediated by toggling prefixes in the dashboard. About 300 prefixes had their service bindings completely removed and required a global edge configuration rollout, which completed at 23:03 UTC.

Lessons

  • API endpoints that accept filter parameters should treat an empty value as a malformed request, not as 'no filter'; permissive defaults on destructive operations are catastrophic.
  • Static-typed schemas for query parameters would have made it harder for client and server to disagree on what an empty string means.
  • Staging fidelity must include not just data shape but also automated task-runner behaviors that mutate user state without explicit input.
  • Sharing one authoritative database for both customer-configured state and operational state makes targeted rollback impossible during incidents.
  • Automated processes that touch BGP at scale need circuit breakers tied to volume and rate-of-change, regardless of whether the change looks valid at the per-record level.

Action items

  • Standardize the API schema so empty parameters are no longer interpreted as missing filters and can be validated by tooling.
  • Separate operational state from configured state by snapshotting the database and applying snapshots through the same health-mediated deployment system used for binaries.
  • Add a circuit breaker that monitors the rate and breadth of BGP prefix changes and halts deployment when thresholds are crossed.
  • Extend customer-traffic health monitoring as an additional signal that can trip the circuit breaker before damage propagates.
  • Continue the Code Orange: Fail Small workstreams, prioritizing controlled rollouts for any change that propagates to the network.