Back to Open Playback
SEV-1public access

useEffect dependency bug overwhelms Tenant Service API and breaks dashboard logins

Cloudflare · Source

Started
Sep 12, 2025
Duration
1h 15m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
Cloudflare Dashboard and APIs that depend on Tenant Service authorization (control plane only, data plane unaffected)
Services
dashboard, tenant-api, control-plane
deployfrontendpartial outagerate limit misconfiguredthundering herd

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

A new dashboard release at 16:32 UTC included a React useEffect with a non-stable object reference in its dependency array, causing the hook to fire on every render and hammer the /organizations endpoint with retries. A coincident Tenant Service deployment at 17:50 UTC began at exactly the wrong moment, and the combined load overwhelmed the service, which sits in the API authorization path. Authorization failures returned 5xx codes from many APIs and left the Cloudflare Dashboard unavailable. A subsequent attempt to fix the Tenant Service made things worse and was reverted, after which dashboard availability fully recovered.

Impact

The Cloudflare Dashboard was severely impacted for the entire 75-minute incident, and the Cloudflare API was severely impacted in two windows when the Tenant Service was unhealthy. Users could not log in or perform configuration changes. Data plane services (CDN, DNS, Cache, WAF) were not affected because of the strict separation of concerns between control and data planes.

Root cause

A dashboard release included a React useEffect hook with a problematic object in its dependency array; because the object was recreated on every state or prop change, React treated it as always-new and re-ran the API call repeatedly during a single render.

The dashboard amplified its own load on the Tenant Service via a retry policy that re-fired on every failure, contributing to the storm.

A Tenant Service deployment was rolling out concurrently, compounding instability and preventing the service from recovering on its own.

When the Tenant Service was restarted, every dashboard re-authenticated simultaneously, creating a thundering herd that knocked the service down again.

Tenant Service was not yet on Argo Rollouts, so a bad subsequent deploy did not auto-rollback; capacity headroom on Tenant Service was also insufficient for unexpected load spikes.

Resolution

Engineers added Tenant Service capacity, installed a global rate limit, and ultimately reverted a degrading patch that had been applied during the incident. Dashboard availability returned to 100 percent at 19:12 UTC.

Lessons

  • React useEffect dependency arrays need stable references; in dashboards that fan out to internal services, an unstable dep can deny-of-service the backend.
  • Authorization services like Tenant Service have outsized blast radius and need both autoscaling headroom and progressive deployment.
  • Recovery thundering herds are a predictable failure mode; clients reconnecting after an outage need jitter and backoff, especially for control-plane traffic.
  • Mid-incident fixes that bypass review can degrade service further; the bias should be toward rollback and stabilization first.
  • Strict control-plane / data-plane separation paid off here: customers' production traffic was unaffected even though their dashboards were down.

Action items

  • Migrate Tenant Service to Argo Rollouts so future bad deploys auto-rollback on detected error.
  • Add jittered random delays to dashboard retries to spread out reconnection storms after outages.
  • Allocate substantially more capacity to Tenant Service and add proactive alerts before capacity limits are hit.
  • Add dashboard request metadata that distinguishes new requests from retries to make future incident triage faster.
  • Audit other dashboard useEffect hooks for unstable dependency arrays.