Third-party storage outage takes Workers KV offline and cascades through Access, WARP, and Dashboard
Cloudflare · Source
- Started
- Jun 12, 2025
- Duration
- 2h 36m
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- Global: Workers KV and every Cloudflare service that depended on it (Access, WARP, Gateway, Dashboard, Images, Stream, Workers AI, Turnstile, Pages, D1, Durable Objects, Queues, AI Gateway, and more)
- Services
- workers-kv, access, warp, gateway, dashboard, turnstile, stream, images, workers-ai, durable-objects, d1, pages, queues, ai-gateway
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
Workers KV's central storage backend, partially backed by a third-party cloud provider, suffered an outage that took Workers KV offline. Because Workers KV is a foundational dependency for many Cloudflare products, the failure cascaded across the platform: Access failed 100% of identity-based logins, WARP could not register new clients, the Dashboard could not authenticate via Turnstile or OIDC, and a long list of products including Workers AI, Stream, Pages, D1, and Durable Objects experienced significant errors. Engineers worked in parallel on bypasses and on accelerating an already-planned migration of KV onto more redundant infrastructure. Service recovered when the third-party storage came back online at 20:23 UTC.
Impact
Workers KV saw a 90.22 percent failure rate for the duration of the incident. Cascading failures hit dozens of services: 100 percent of Access identity-based logins failed, all WARP device registrations failed, dashboard logins failed across SSO and OIDC paths, Stream and Workers AI saw near-100 percent error rates, and Pages builds could not complete. Cache, DNS, Magic Transit, and the v4 API were unaffected.
Root cause
Workers KV's coreless runtime relies on a central data store as the source of truth for cold reads and writes; that store is partially backed by a third-party cloud provider that experienced an outage.
Workers KV was mid-migration to more resilient storage infrastructure (including R2) and had a coverage gap during the transition that this incident exposed.
Cloudflare's design principle of building services on its own platform meant many products had a transitive dependency on Workers KV, so the blast radius far exceeded KV itself.
Several products including Access and Gateway are designed to fail closed when policy or identity data cannot be retrieved, which is correct for security but amplified the impact when KV was unavailable.
Resolution
Engineers reduced impact at 18:21 UTC by upgrading priority and at 19:09 UTC by gracefully degrading Gateway rules that depended on identity or device posture. At 19:32 UTC, Access and Device Posture began dropping identity and device posture requests to shed load until the third-party storage recovered. The third-party service began recovering at 20:23 UTC; Access restored its KV calls at 20:25 UTC; service-level objectives returned to baseline at 20:28 UTC.
Lessons
- Building services on a single shared key-value store creates massive transitive blast radius when that store fails, even if every downstream service is internally well-designed.
- Single-vendor storage dependencies, even with redundant configurations, are a single point of failure if the vendor itself goes down regionally.
- Fail-closed behavior is correct for security-critical paths like authentication, but designers should consciously decide which products fail-open vs. fail-closed under their dependency assumptions.
- Thundering herd on recovery is predictable: when KV came back, repopulating caches caused infrastructure rate-limit pressure and a non-negligible tail of errors.
- Mid-migration coverage gaps are a known operational risk; the migration timeline itself becomes a reliability factor.
Action items
- Accelerate Workers KV's migration onto multi-provider storage infrastructure to remove dependency on any single provider.
- Add product-level blast-radius mitigations so individual products are resilient to single-point-of-failure events in shared dependencies.
- Build progressive re-enable tooling for KV namespaces during storage incidents so critical dependencies (Access, WARP) recover first without DDoSing the recovering backend.
- Document and revisit the fail-open vs. fail-closed decision for every product that depends on Workers KV.
- Audit transitive Workers KV dependencies across all Cloudflare products.