SEV-0public access

DR exercise cleanup destroys Kafka brokers, leaves partitions leaderless, triggers two-week EU evacuation

Honeycomb · Source

Started: Dec 5, 2025
Duration: 12d
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: EU region: full event ingestion downtime for several hours, then degraded mode (Activity Log only) for two weeks
Services: kafka, retriever, shepherd, ingest, tiered-storage, activity-log

cascading failuredata lossfailoverqueue or stream

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

On December 5, 2025, Honeycomb ran an annual disaster recovery exercise in the EU production region simulating an availability-zone failure. During the cleanup phase that follows AZ-failure tests, the runbook called for purposely destroying Kafka brokers. In the production cluster's larger topology, this killed brokers across multiple availability zones in an unlucky way that left several partitions leaderless and several internal metadata topics damaged. Recovery extended to December 17 and culminated in a full Kafka cluster evacuation to a brand new cluster, with deep code changes in Retriever to support offset resets and a coordinated migration involving over half a dozen teams. Activity Log data from December 5 to December 9 was lost.

Impact

All EU event ingestion endpoints were down for multiple hours on December 5, 2025. Most of the subsequent two weeks were spent in a degraded mode where only Activity Log data was impacted. Roughly 2.5% of the cluster's data was permanently lost on a single fully-corrupted partition. Activity Log data from December 5 (~6 p.m. ET) to December 9 (~6 p.m. ET) was lost; data after December 9 was salvaged via a snapshot-and-replicate trick.

Root cause

The DR exercise runbook included a cleanup step that purposely destroys Kafka brokers; in production's larger topology, this could remove all in-sync brokers for some partitions, while in the smaller pre-prod clusters with three nodes the same operation could not produce that outcome.

Pre-production exercises had been run four times in clusters too small to reproduce the failure mode, so the structural risk remained hidden during validation.

When partitions went leaderless, only dirty leader elections were possible, which can roll back offsets. Retriever, by design, refuses to process data when offsets unexpectedly reset, to avoid corrupting stored data; this took ingest down hard for affected partitions.

Internal Kafka metadata topics (`_confluent-tier-state`, `__consumer_offsets`) were damaged simultaneously, blocking tiered storage offload and causing disk usage on brokers to climb dangerously fast.

An attempt to delete and recreate damaged Activity Log topics put the Kafka cluster into a state where administrative operations (describe, reassign, modify) all timed out, leaving the cluster effectively unmanageable.

Documentation fragmentation: prior RFCs and high-level workaround plans for similar conditions existed but were either unknown to most responders or undiscoverable through internal search.

Bumpable-roadmap dynamics: long-running internal projects that would have eased these conditions had been deferred over time in favor of clearer, more urgent customer-facing work, even when both were judged important.

Resolution

Over twelve days, the team executed a full evacuation to a brand-new Kafka cluster. Retriever was modified to support resetting its offset tracking, feature flags were added to swing readers and writers between clusters, and a dress rehearsal in EU pre-prod validated the migration. The actual production migration ran in roughly two hours with at most one hour of delayed signal. Earlier intermediate fixes included repairing damaged metadata topic partitions to unstick SLO processing, turning off Confluent Tiered Storage to immediately free disk, marking damaged Retriever partitions read-only, and using an AWS replica-snapshot trick to extend MySQL binlog retention long enough to salvage most of the Activity Log data.

Timeline

17:00MITIG
Annual DR exercise simulating an AZ failure runs successfully in the EU production region. Cleanup phase begins per runbook.
kafka
18:00DETECT
Cleanup step destroys Kafka brokers. In the production topology, multiple partitions go leaderless and metadata topics are damaged. Ingest fails for many EU teams.
kafka
19:00INVEST
Roughly an hour in, the team identifies which customers and environments need partition reassignment. Shepherd is in a crashloop spiral due to memory buffering for unwritable partitions.
shepherd
20:00MITIG
Damaged partitions marked read-only on the producer side, fixing Shepherd's crashloop. Reassignments restore baseline ingest.
shepherd
23:00MITIG
Kafka autobalancer turned off; manual reassignment plan started for the night. About one third of partitions impacted; one partition has full data loss (~2.5% of cluster).
kafka
08:00DETECT
Saturday morning: Kafka rebalancing has not progressed and disk usage is climbing 5% per hour. A 95%-disk shutdown plan is prepared; ~5 hours of runway remain.
kafka
13:00MITIG
Two damaged metadata topic partitions (`_confluent-tier-state`, `__consumer_offsets`) repaired. SLO processing immediately catches up on ~18h of late data.
kafka
15:00MITIG
With ~5% disk free, ingest is turned off at the ALB layer. Confluent Tiered Storage is then disabled to unstick disk cleanup; disk space is instantly restored. Ingest reopened; customer impact ends.
kafka
12:00INVEST
Monday: most response now focused on extending MySQL binlog retention for Activity Log replication and salvaging the broken Retriever partition.
mysql-rds
15:00MITIG
Wednesday: an attempt to delete and recreate Activity Log topic partitions times out. Suddenly, Kafka brokers can no longer perform any administrative operations beyond listing topics. Cluster is unmanageable.
kafka
18:00INVEST
Two parallel workstreams started: one to attempt non-risky cluster salvage via ZooKeeper surgery; one to plan a full evacuation to a fresh cluster. Directors aligned across teams.
kafka
12:00MITIG
Multiple evacuation plans drafted with five fallback scenarios. Storage team works on enabling Retriever to handle offset resets safely.
retriever
18:00MITIG
Friday: Retriever offset-reset mechanism works in test. Platform team makes feature flags ready and verifies a second Kafka cluster can run alongside the first.
retriever
18:00MITIG
Monday: full dress rehearsal of the migration in an EU pre-prod cluster involving over half a dozen teams. Plan completes in roughly four hours with manageable friction.
kafka
16:00MITIG
Tuesday: production migration runs for about two hours with at most one hour of delayed signal. New cluster live, old cluster cleaned up.
kafka
18:00RESOLV
Migration complete. Activity Log data from Dec 9 onward salvaged via a snapshot-and-replicate trick; data from Dec 5 to Dec 9 lost.
kafka

Attribution

Honeycomb

By Engineering

Published Dec 17, 2025

View original source

Lessons

Pre-production validation of disaster exercises must include topology-equivalent clusters; if pre-prod's structure is fundamentally different from prod's, a clean run there proves nothing about the operation in prod.
Documents that exist but cannot be found are not documents that exist; the test of organizational memory is search, not authorship.
Bumpable work tends to get bumped indefinitely; making this dynamic visible at the leadership layer is a structural problem, not a discipline problem.
Deep systems knowledge often lives with people who no longer own the affected components; team boundaries silo the social side faster than they silo the technical side, producing drift that surfaces only during incidents.
Retriever's design choice to refuse data on unexpected offset resets prevents corruption but converts a Kafka recovery into a Retriever rebuild; either-or designs of this kind deserve revisiting at higher load.
Communicating internal severity accurately to customers is hard when most of the impact lands on a beta feature; over-disclosure can confuse customers about what is actually breaking for them.
Long-tail incidents take a sustained human toll that outlasts the technical recovery; rotating responders, calling in directors to align organizational effort, and sustaining the response are themselves engineering activities.

Action items

Full evacuation completed to a brand-new Kafka cluster in the EU region.
Retriever updated to support offset reset as a one-time operation, allowing safer cluster migrations in the future.
Feature flags added so Kafka readers and writers can swing between clusters without interfering with other clusters or environments.
Tiered storage state cleared and rebuilt; old S3 data deleted before turning storage tiers back on.
Activity Log data partially salvaged using AWS support's snapshot-and-replicate trick; data from Dec 5-9 lost permanently.
Investing in better documentation discoverability, since multiple existing documents about similar failure modes were not found in time.
Planning to make team ownership maps and cross-cutting expertise more visible to address sociotechnical drift.
Reviewing the Eisenhower-matrix dynamics that lead bumpable internal work to be deferred indefinitely.