DR exercise cleanup destroys Kafka brokers, leaves partitions leaderless, triggers two-week EU evacuation
Honeycomb · Source
- Started
- Dec 5, 2025
- Duration
- 12d
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- EU region: full event ingestion downtime for several hours, then degraded mode (Activity Log only) for two weeks
- Services
- kafka, retriever, shepherd, ingest, tiered-storage, activity-log
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
On December 5, 2025, Honeycomb ran an annual disaster recovery exercise in the EU production region simulating an availability-zone failure. During the cleanup phase that follows AZ-failure tests, the runbook called for purposely destroying Kafka brokers. In the production cluster's larger topology, this killed brokers across multiple availability zones in an unlucky way that left several partitions leaderless and several internal metadata topics damaged. Recovery extended to December 17 and culminated in a full Kafka cluster evacuation to a brand new cluster, with deep code changes in Retriever to support offset resets and a coordinated migration involving over half a dozen teams. Activity Log data from December 5 to December 9 was lost.
Impact
All EU event ingestion endpoints were down for multiple hours on December 5, 2025. Most of the subsequent two weeks were spent in a degraded mode where only Activity Log data was impacted. Roughly 2.5% of the cluster's data was permanently lost on a single fully-corrupted partition. Activity Log data from December 5 (~6 p.m. ET) to December 9 (~6 p.m. ET) was lost; data after December 9 was salvaged via a snapshot-and-replicate trick.
Root cause
The DR exercise runbook included a cleanup step that purposely destroys Kafka brokers; in production's larger topology, this could remove all in-sync brokers for some partitions, while in the smaller pre-prod clusters with three nodes the same operation could not produce that outcome.
Pre-production exercises had been run four times in clusters too small to reproduce the failure mode, so the structural risk remained hidden during validation.
When partitions went leaderless, only dirty leader elections were possible, which can roll back offsets. Retriever, by design, refuses to process data when offsets unexpectedly reset, to avoid corrupting stored data; this took ingest down hard for affected partitions.
Internal Kafka metadata topics (`_confluent-tier-state`, `__consumer_offsets`) were damaged simultaneously, blocking tiered storage offload and causing disk usage on brokers to climb dangerously fast.
An attempt to delete and recreate damaged Activity Log topics put the Kafka cluster into a state where administrative operations (describe, reassign, modify) all timed out, leaving the cluster effectively unmanageable.
Documentation fragmentation: prior RFCs and high-level workaround plans for similar conditions existed but were either unknown to most responders or undiscoverable through internal search.
Bumpable-roadmap dynamics: long-running internal projects that would have eased these conditions had been deferred over time in favor of clearer, more urgent customer-facing work, even when both were judged important.
Resolution
Over twelve days, the team executed a full evacuation to a brand-new Kafka cluster. Retriever was modified to support resetting its offset tracking, feature flags were added to swing readers and writers between clusters, and a dress rehearsal in EU pre-prod validated the migration. The actual production migration ran in roughly two hours with at most one hour of delayed signal. Earlier intermediate fixes included repairing damaged metadata topic partitions to unstick SLO processing, turning off Confluent Tiered Storage to immediately free disk, marking damaged Retriever partitions read-only, and using an AWS replica-snapshot trick to extend MySQL binlog retention long enough to salvage most of the Activity Log data.
Lessons
- Pre-production validation of disaster exercises must include topology-equivalent clusters; if pre-prod's structure is fundamentally different from prod's, a clean run there proves nothing about the operation in prod.
- Documents that exist but cannot be found are not documents that exist; the test of organizational memory is search, not authorship.
- Bumpable work tends to get bumped indefinitely; making this dynamic visible at the leadership layer is a structural problem, not a discipline problem.
- Deep systems knowledge often lives with people who no longer own the affected components; team boundaries silo the social side faster than they silo the technical side, producing drift that surfaces only during incidents.
- Retriever's design choice to refuse data on unexpected offset resets prevents corruption but converts a Kafka recovery into a Retriever rebuild; either-or designs of this kind deserve revisiting at higher load.
- Communicating internal severity accurately to customers is hard when most of the impact lands on a beta feature; over-disclosure can confuse customers about what is actually breaking for them.
- Long-tail incidents take a sustained human toll that outlasts the technical recovery; rotating responders, calling in directors to align organizational effort, and sustaining the response are themselves engineering activities.
Action items
- Full evacuation completed to a brand-new Kafka cluster in the EU region.
- Retriever updated to support offset reset as a one-time operation, allowing safer cluster migrations in the future.
- Feature flags added so Kafka readers and writers can swing between clusters without interfering with other clusters or environments.
- Tiered storage state cleared and rebuilt; old S3 data deleted before turning storage tiers back on.
- Activity Log data partially salvaged using AWS support's snapshot-and-replicate trick; data from Dec 5-9 lost permanently.
- Investing in better documentation discoverability, since multiple existing documents about similar failure modes were not found in time.
- Planning to make team ownership maps and cross-cutting expertise more visible to address sociotechnical drift.
- Reviewing the Eisenhower-matrix dynamics that lead bumpable internal work to be deferred indefinitely.