SEV-1public access

Third-party storage outage takes Workers KV offline and cascades through Access, WARP, and Dashboard

Cloudflare · Source

Started: Jun 12, 2025
Duration: 2h 36m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: Global: Workers KV and every Cloudflare service that depended on it (Access, WARP, Gateway, Dashboard, Images, Stream, Workers AI, Turnstile, Pages, D1, Durable Objects, Queues, AI Gateway, and more)
Services: workers-kv, access, warp, gateway, dashboard, turnstile, stream, images, workers-ai, durable-objects, d1, pages, queues, ai-gateway

cascading failuredependency outageobject storagesupply chainthird-party outagethundering herd

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

Workers KV's central storage backend, partially backed by a third-party cloud provider, suffered an outage that took Workers KV offline. Because Workers KV is a foundational dependency for many Cloudflare products, the failure cascaded across the platform: Access failed 100% of identity-based logins, WARP could not register new clients, the Dashboard could not authenticate via Turnstile or OIDC, and a long list of products including Workers AI, Stream, Pages, D1, and Durable Objects experienced significant errors. Engineers worked in parallel on bypasses and on accelerating an already-planned migration of KV onto more redundant infrastructure. Service recovered when the third-party storage came back online at 20:23 UTC.

Impact

Workers KV saw a 90.22 percent failure rate for the duration of the incident. Cascading failures hit dozens of services: 100 percent of Access identity-based logins failed, all WARP device registrations failed, dashboard logins failed across SSO and OIDC paths, Stream and Workers AI saw near-100 percent error rates, and Pages builds could not complete. Cache, DNS, Magic Transit, and the v4 API were unaffected.

Root cause

Workers KV's coreless runtime relies on a central data store as the source of truth for cold reads and writes; that store is partially backed by a third-party cloud provider that experienced an outage.

Workers KV was mid-migration to more resilient storage infrastructure (including R2) and had a coverage gap during the transition that this incident exposed.

Cloudflare's design principle of building services on its own platform meant many products had a transitive dependency on Workers KV, so the blast radius far exceeded KV itself.

Several products including Access and Gateway are designed to fail closed when policy or identity data cannot be retrieved, which is correct for security but amplified the impact when KV was unavailable.

Resolution

Engineers reduced impact at 18:21 UTC by upgrading priority and at 19:09 UTC by gracefully degrading Gateway rules that depended on identity or device posture. At 19:32 UTC, Access and Device Posture began dropping identity and device posture requests to shed load until the third-party storage recovered. The third-party service began recovering at 20:23 UTC; Access restored its KV calls at 20:25 UTC; service-level objectives returned to baseline at 20:28 UTC.

Timeline

17:52DETECT
WARP team observes new device registrations failing and declares an incident.
warp
18:05DETECT
Cloudflare Access team is alerted to a rapid increase in error rates; service-level objectives drop below target across multiple services.
access
18:06INVEST
Multiple service-specific incidents are merged into a single P1 incident as Workers KV unavailability is identified as the shared cause.
workers-kv
18:21INVEST
Incident upgraded from P1 to P0 as severity becomes clear.
workers-kv
18:43MITIG
Cloudflare Access begins exploring options to migrate off Workers KV to a different backing datastore.
access
19:09MITIG
Zero Trust Gateway begins removing dependencies on Workers KV by gracefully degrading rules that referenced identity or device posture.
gateway
19:32MITIG
Access and Device Posture force-drop identity and device-posture requests to shed load on Workers KV until the third-party service comes back online.
access
20:23MITIG
Storage infrastructure begins recovering; services begin to recover, though non-negligible error rates and rate limits persist as caches repopulate.
workers-kv
20:25MITIG
Access and Device Posture restore calls to Workers KV after the third-party service is restored.
access
20:28RESOLV
Service-level objectives return to pre-incident levels; impact ends.
workers-kv

Attribution

Cloudflare

By Workers KV / Platform

Published Jun 12, 2025

View original source

Lessons

Building services on a single shared key-value store creates massive transitive blast radius when that store fails, even if every downstream service is internally well-designed.
Single-vendor storage dependencies, even with redundant configurations, are a single point of failure if the vendor itself goes down regionally.
Fail-closed behavior is correct for security-critical paths like authentication, but designers should consciously decide which products fail-open vs. fail-closed under their dependency assumptions.
Thundering herd on recovery is predictable: when KV came back, repopulating caches caused infrastructure rate-limit pressure and a non-negligible tail of errors.
Mid-migration coverage gaps are a known operational risk; the migration timeline itself becomes a reliability factor.

Action items

Accelerate Workers KV's migration onto multi-provider storage infrastructure to remove dependency on any single provider.
Add product-level blast-radius mitigations so individual products are resilient to single-point-of-failure events in shared dependencies.
Build progressive re-enable tooling for KV namespaces during storage incidents so critical dependencies (Access, WARP) recover first without DDoSing the recovering backend.
Document and revisit the fail-open vs. fail-closed decision for every product that depends on Workers KV.
Audit transitive Workers KV dependencies across all Cloudflare products.