Wrangler env flag omitted during R2 credential rotation deploys new keys to dev Worker, breaks production
Cloudflare · Source
- Started
- Mar 21, 2025
- Duration
- 1h 7m
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- Global: R2 customers and dependent services (Cache Reserve, Images, Stream, Logpush, Vectorize, Email Security metrics, Billing invoices)
- Services
- r2, r2-gateway, wrangler, cache-reserve, images, stream, logpush, vectorize
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
During a routine R2 credential rotation, an engineer ran `wrangler secret put` and `wrangler deploy` without the `--env production` flag. Both commands default to the default environment, so the new credentials landed on a non-production R2 Gateway Worker while the production Worker continued using the old credentials. When the old credentials were deleted from storage as the final step of rotation, the production Gateway lost its ability to authenticate. Investigation took longer than necessary because there was no observability tying credential ID to the live Gateway Worker, so engineers spent over an hour suspecting credential propagation issues before discovering the wrong-environment deploy.
Impact
For 67 minutes, all R2 write operations failed and approximately 35 percent of read operations failed globally. Operations involving only metadata (head, list) were unaffected. Cascading impact hit Stream (94 percent successful video segment delivery, 100 percent of Stream Live failed), Images (100 percent of uploads failed), Vectorize (all inserts/upserts failed), Logpush (up to 70 minutes of log delivery delay), and customer access to past Cloudflare invoices. There was no data loss or corruption.
Root cause
Both `wrangler secret put` and `wrangler deploy` default to the default environment when `--env` is not specified; this default behavior is hostile to production-critical workflows.
The credential rotation runbook required manual command entry rather than running through hotfix release tooling that enforces environment configuration and other safety checks.
There was no logging tag linking the credential ID actually in use to the production Worker, so engineers could not directly verify which token the Gateway was authenticating with.
The propagation delay between deleting the old storage credentials and observing impact created an additional 60 minutes of confusion, during which the team suspected propagation issues rather than a wrong-environment deploy.
The rotation procedure did not require explicit human verification of the new token's suffix in storage logs before the old token was deleted.
Resolution
After exploring credential propagation as the suspected cause and creating a fresh credential pair (which still landed on the wrong Worker because the missing --env flag was repeated), engineers reviewed production Worker release history and discovered the wrong-environment deploy. Credentials were redeployed to the production Worker with `--env production`, and R2 availability recovered immediately at 22:45 UTC.
Lessons
- CLI tools whose default targets a non-production environment but whose successful invocation looks identical to a successful production deploy are a major operational hazard.
- Observability that ties live state (which token is in use) back to deploy history would have shortened this incident from 67 minutes to a few minutes.
- Credential rotation procedures should be automated through hotfix tooling rather than manual command entry; humans omit flags under time pressure.
- Steady-state propagation delays between control-plane changes and data-plane effects are a known confounder during incident response and should be documented.
- Two-person verification of destructive credential rotation steps is the kind of process that always seems redundant until it's the thing that prevents an outage.
Action items
- Add logging tags that include the suffix of the credential ID the R2 Gateway Worker uses to authenticate with storage infrastructure.
- Require explicit confirmation that the new token suffix matches storage logs before deleting the previous token.
- Move key rotation to hotfix release tooling that enforces environment configuration and other safety checks.
- Update SOPs to explicitly require two humans to validate every step of credential rotation.
- Build a closed-loop health check system that tests new keys, alerts on status, and confirms global propagation before releasing the Gateway Worker.
- Update observability to include views of upstream storage success rates that bypass caching, giving clearer signal during similar incidents.