SEV-0public access

ClickHouse permissions change doubles Bot Management feature file size and panics the core proxy

Cloudflare · Source

Started: Nov 18, 2025
Duration: 5h 38m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: Global: majority of core HTTP traffic through Cloudflare's network, plus dependent services (Workers KV, Access, Turnstile, Dashboard, Email Security)
Services: bot-management, fl2-proxy, fl-proxy, workers-kv, access, turnstile, dashboard, clickhouse, email-security

configuration changedatabasefull outagethundering herd

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

A gradual ClickHouse permissions improvement made user accounts able to see metadata for the underlying r0 schema in addition to the default schema. A long-standing query in the Bot Management feature-file generator did not filter by database name, so it began returning duplicate rows and producing a feature file roughly twice its expected size. The new FL2 proxy (Rust) preallocates memory for a hard cap of 200 features, so when the oversized file arrived, the bot module panicked and the proxy returned HTTP 5xx errors for any traffic depending on bot scoring. The legacy FL proxy did not panic but emitted bot scores of zero, causing false positives for any customer using bot-score-based blocking rules.

Impact

Cloudflare's worst outage since 2019. The majority of core HTTP traffic returned 5xx errors for over an hour, with cascading failures across Workers KV, Access (100% authentication failure), Turnstile, the Dashboard (login failures via Turnstile and OIDC), and Email Security. Customers on the legacy FL proxy did not see 5xx errors directly but received false-positive bot blocks because every request was scored as a bot. Cloudflare's status page also went offline coincidentally, briefly leading the team to suspect a coordinated attack.

Root cause

A ClickHouse permissions change at 11:05 UTC made implicit access to the underlying r0 tables explicit, so user-level queries against system.tables and system.columns now also returned the r0 schema in addition to the default schema.

The Bot Management feature-file generator used a query against system.columns that filtered by table name but not by database name, so once the change rolled out it began returning duplicate column rows from r0.

The feature file roughly doubled in size, exceeding the 200-feature memory preallocation cap in the FL2 proxy bot module; the Rust unwrap() on the size check turned the failed Result into a panic.

The fluctuating recovery pattern (good vs bad files generated every five minutes depending on which ClickHouse node ran the query) initially led the team to suspect a hyperscale DDoS attack rather than an internal bug.

Recovery was slowed by debugging and observability code that consumed large amounts of CPU enriching uncaught errors, increasing latency across the affected proxy.

Resolution

After identifying Bot Management as the source, engineers stopped automatic generation and propagation of new feature files at 14:24 UTC, manually inserted a known-good feature file, and forced a restart of the core proxy. Most traffic was flowing normally by 14:30 UTC. Workers KV had bypassed the core proxy at 13:05 UTC to stabilize Access. Remaining services were restarted and load-balanced over the next few hours, with all systems back to normal at 17:06 UTC.

Timeline

11:05MITIG
ClickHouse access control change is deployed, gradually granting users explicit access to underlying r0 tables.
clickhouse
11:28DETECT
Deployment reaches customer environments; first 5xx errors observed on customer HTTP traffic as oversized feature files reach FL2 proxies.
bot-management
11:31DETECT
Automated test detects the issue; manual investigation begins one minute later.
fl2-proxy
11:35INVEST
Incident call is created; team initially focuses on degraded Workers KV response rates.
workers-kv
13:05MITIG
Workers KV and Access bypass the core proxy by falling back to a prior version, reducing impact on dependent services.
workers-kv
13:37INVEST
Team becomes confident that the Bot Management feature file is the trigger; multiple workstreams begin in parallel including a known-good file restore.
bot-management
14:24MITIG
Automatic generation and propagation of new feature files is stopped; the test of restoring the previous file completes successfully.
bot-management
14:30MITIG
Correct feature file is deployed globally; most services begin operating normally.
bot-management
17:06RESOLV
All downstream services restarted and fully operational; impact ends.
workers-kv

Attribution

Cloudflare

By Bot Management / Core Proxy

Published Nov 18, 2025

View original source

Lessons

Cloudflare-generated configuration files arriving in the data plane should be treated as untrusted input with the same rigor as user-supplied data; a hard cap with no fallback is just a future panic.
Coreless services that read shared configuration files inherit a single point of failure through the file generation pipeline, even when the runtime path is fully distributed.
Symptoms that fluctuate (recover, fail, recover) suggest staged rollouts or partial state, not necessarily an attack; teams need an explicit hypothesis check before assuming adversarial cause.
Status-page hosting is supposed to be independent of production, and during this incident it was, but a coincident status-page outage can derail diagnosis by suggesting a coordinated attack.
Database queries that omit a filter (here, the database name) will silently change behavior the moment access scope expands; SQL is not type-safe against schema-shape changes.

Action items

Harden the ingestion of Cloudflare-generated configuration files in the same way as user-generated input, including bounds checks that fail open rather than panic.
Add more global kill switches for individual proxy modules, including Bot Management.
Eliminate the ability for core dumps and error reports to overwhelm system resources during cascading failures.
Review failure modes for error conditions across all core proxy modules and replace unwrap()-style panics with explicit handling.
Re-evaluate ClickHouse query patterns across the codebase for queries that depend on the previous (narrower) access scope.