Back to Open Playback
SEV-1public access

Killswitch on a never-tested rule type triggers nil-value Lua exception in legacy proxy during React vulnerability mitigation

Cloudflare · Source

Started
Dec 5, 2025
Duration
25m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
About 28% of HTTP traffic served by Cloudflare globally; specifically customers on FL1 with the Cloudflare Managed Ruleset enabled
Services
waf, fl1-proxy, cdn, rulesets
configuration changeconfiguration errorpartial outageregression from deploy

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

While rolling out a buffer-size increase to mitigate the React Server Components remote code execution vulnerability (CVE-2025-55182), engineers discovered that an internal WAF testing tool did not support the larger 1MB buffer. Disabling the testing tool through Cloudflare's global configuration system propagated network-wide within seconds and triggered a latent bug in the Lua-based FL1 proxy: a killswitch had never been applied to an 'execute' rule action before, and the post-evaluation code assumed the resulting object would always exist. The nil-value lookup raised an exception and caused FL1 to return HTTP 500 errors for all affected customers for 25 minutes.

Impact

About 28 percent of HTTP traffic served by Cloudflare returned HTTP 500 errors during the incident. Only customers whose web assets were still served by the older FL1 proxy and who had the Cloudflare Managed Ruleset deployed were affected; FL2 customers, China network customers, and customers without that ruleset were not impacted.

Root cause

An internal WAF rule-testing tool did not support the larger 1MB buffer size being rolled out to mitigate CVE-2025-55182, so engineers chose to disable the tool rather than block the security rollout.

The disable was implemented through Cloudflare's global configuration system, which propagates changes to the entire fleet within seconds and does not perform gradual rollouts.

FL1's rulesets module had never had a killswitch applied to a rule with action 'execute'; the killswitch correctly skipped evaluation but the post-loop code unconditionally accessed rule_result.execute, which was now nil.

The Lua codebase has no static type system to flag the unguarded field access; the bug had existed undetected for years until that code path was triggered.

Resilience improvements proposed after the November 18 incident, including health-mediated rollouts and fail-open error handling for configuration changes, had not yet been deployed.

Resolution

Engineers identified the Lua exception in proxy logs and reverted the configuration change at 09:11 UTC. The revert fully propagated by 09:12 UTC, restoring all traffic.

Lessons

  • Killswitch systems need to be tested against every rule action type the rulesets engine supports, not only the common ones; an unexercised killswitch path is just a hidden bug.
  • Bypassing gradual rollout for an 'unrelated' supporting change because it has no customer impact is exactly when blast radius gets ignored, and it is exactly the path through which past incidents have happened.
  • Latent bugs in dynamically typed code can survive for years; rewriting the same logic in Rust as part of FL2 prevented the same class of bug from surviving in the new proxy.
  • Security urgency can pressure teams into bypassing safety rails; the system needs to make the safe path the easy path even under time pressure.

Action items

  • Add test coverage for killswitch behavior across every supported rule action type, including 'execute'.
  • Apply the Enhanced Rollouts and Versioning workstream from the post-November-18 plan to data used for rapid threat response, including health validation and quick rollback.
  • Replace hard-fail logic across critical data-plane components with fail-open behavior backed by drift prevention.
  • Build streamlined break-glass capabilities so critical operations remain possible during data-plane failures.
  • Lock down all changes to the network until rollback systems and blast-radius mitigations are in place.