ClickHouse permissions change doubles Bot Management feature file size and panics the core proxy
Cloudflare · Source
- Started
- Nov 18, 2025
- Duration
- 5h 38m
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- Global: majority of core HTTP traffic through Cloudflare's network, plus dependent services (Workers KV, Access, Turnstile, Dashboard, Email Security)
- Services
- bot-management, fl2-proxy, fl-proxy, workers-kv, access, turnstile, dashboard, clickhouse, email-security
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
A gradual ClickHouse permissions improvement made user accounts able to see metadata for the underlying r0 schema in addition to the default schema. A long-standing query in the Bot Management feature-file generator did not filter by database name, so it began returning duplicate rows and producing a feature file roughly twice its expected size. The new FL2 proxy (Rust) preallocates memory for a hard cap of 200 features, so when the oversized file arrived, the bot module panicked and the proxy returned HTTP 5xx errors for any traffic depending on bot scoring. The legacy FL proxy did not panic but emitted bot scores of zero, causing false positives for any customer using bot-score-based blocking rules.
Impact
Cloudflare's worst outage since 2019. The majority of core HTTP traffic returned 5xx errors for over an hour, with cascading failures across Workers KV, Access (100% authentication failure), Turnstile, the Dashboard (login failures via Turnstile and OIDC), and Email Security. Customers on the legacy FL proxy did not see 5xx errors directly but received false-positive bot blocks because every request was scored as a bot. Cloudflare's status page also went offline coincidentally, briefly leading the team to suspect a coordinated attack.
Root cause
A ClickHouse permissions change at 11:05 UTC made implicit access to the underlying r0 tables explicit, so user-level queries against system.tables and system.columns now also returned the r0 schema in addition to the default schema.
The Bot Management feature-file generator used a query against system.columns that filtered by table name but not by database name, so once the change rolled out it began returning duplicate column rows from r0.
The feature file roughly doubled in size, exceeding the 200-feature memory preallocation cap in the FL2 proxy bot module; the Rust unwrap() on the size check turned the failed Result into a panic.
The fluctuating recovery pattern (good vs bad files generated every five minutes depending on which ClickHouse node ran the query) initially led the team to suspect a hyperscale DDoS attack rather than an internal bug.
Recovery was slowed by debugging and observability code that consumed large amounts of CPU enriching uncaught errors, increasing latency across the affected proxy.
Resolution
After identifying Bot Management as the source, engineers stopped automatic generation and propagation of new feature files at 14:24 UTC, manually inserted a known-good feature file, and forced a restart of the core proxy. Most traffic was flowing normally by 14:30 UTC. Workers KV had bypassed the core proxy at 13:05 UTC to stabilize Access. Remaining services were restarted and load-balanced over the next few hours, with all systems back to normal at 17:06 UTC.
Lessons
- Cloudflare-generated configuration files arriving in the data plane should be treated as untrusted input with the same rigor as user-supplied data; a hard cap with no fallback is just a future panic.
- Coreless services that read shared configuration files inherit a single point of failure through the file generation pipeline, even when the runtime path is fully distributed.
- Symptoms that fluctuate (recover, fail, recover) suggest staged rollouts or partial state, not necessarily an attack; teams need an explicit hypothesis check before assuming adversarial cause.
- Status-page hosting is supposed to be independent of production, and during this incident it was, but a coincident status-page outage can derail diagnosis by suggesting a coordinated attack.
- Database queries that omit a filter (here, the database name) will silently change behavior the moment access scope expands; SQL is not type-safe against schema-shape changes.
Action items
- Harden the ingestion of Cloudflare-generated configuration files in the same way as user-generated input, including bounds checks that fail open rather than panic.
- Add more global kill switches for individual proxy modules, including Bot Management.
- Eliminate the ability for core dumps and error reports to overwhelm system resources during cascading failures.
- Review failure modes for error conditions across all core proxy modules and replace unwrap()-style panics with explicit handling.
- Re-evaluate ClickHouse query patterns across the codebase for queries that depend on the previous (narrower) access scope.