Kafka 0.10 controller bug after ZooKeeper network partition causes write loss across four partitions
Honeycomb · Source
- Started
- Oct 17, 2017
- Duration
- 6h
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- 33% of customers actively sending data; 4 of N Kafka partitions
- Services
- kafka, zookeeper, ingest, retriever
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
A ZooKeeper network partition the night before silently left the Kafka cluster in a fragile state. Around 6 a.m. PDT the next morning, the Kafka controller did something that exposed the latent damage and end-to-end checks began failing on four partitions. Engineers spent hours debugging a split-brain condition between restarted and un-restarted brokers, with offsets on data nodes drifting ahead of acknowledged offsets because Kafka kept accepting writes ZooKeeper had not acknowledged. Recovery required restarting all brokers and manually resetting offsets on data nodes. The underlying issue was attributed to Kafka bugs fixed in 0.10.2.1.
Impact
About a third of customers actively sending telemetry experienced partial write loss between 6:03 and 10:45 a.m. PDT, with most losing under half their writes. A majority of users also experienced roughly 30 minutes of partial read unavailability between 10:50 and 11:20 a.m. PDT.
Root cause
A ZooKeeper network partition the previous night left the cluster in a degraded state that did not surface symptoms until the controller thread acted on it the next morning.
Known Kafka bugs in the version Honeycomb was running (resolved in 0.10.2.1) caused brokers to accept writes that ZooKeeper had not acknowledged, allowing data-node offsets to advance past acknowledged offsets.
Restarting brokers without preserving quorum awareness produced a split-brain: restarted nodes only saw other restarted nodes, while unrestarted nodes only saw each other.
The team was operating Kafka at first-incident depth of expertise, so initial recovery moves were overly cautious and explored several red herrings down ZooKeeper paths.
Resolution
Engineers eventually restarted all Kafka nodes, then manually reset offsets on the data nodes and restarted or bootstrapped them so they aligned with what Kafka had actually durably committed. Service was fully restored by approximately 12:03 p.m. PDT.
Lessons
- A network partition in a coordination service can leave a downstream system in a latent broken state for hours before symptoms surface, decoupling the trigger from the visible incident.
- When recovering a stateful cluster under pressure, every restart must preserve quorum visibility or the cluster can fracture into a split brain that is harder to resolve than the original problem.
- First-time experience with a class of failure inflates time-to-resolution; investing in deliberate runbooks and exercises for unfamiliar failure modes pays back during real incidents.
- Acknowledged offsets and durably-written offsets can diverge silently when the broker layer disagrees with the coordination layer, and assuming they match is a trap during recovery.
Action items
- Upgrade the Kafka cluster to a version that includes the relevant controller fixes (0.10.2.1 or later).
- Improve instrumentation for Kafka and ZooKeeper inside the team's own observability environment.
- Get the Retriever Kafka partition assignment exposed via EC2 instance tags; write shell helpers to print Retriever-to-partition mappings.
- Add instrumentation for failed writes to Kafka, including consumption of the Sarama response queue.
- Change how data nodes handle invariants so that an offset ahead of the broker no longer requires manual intervention.
- Document the bash snippets and command lines used during recovery into a Retriever and Kafka production runbook.