SEV-1public access

Kafka 0.10 controller bug after ZooKeeper network partition causes write loss across four partitions

Honeycomb · Source

Started: Oct 17, 2017
Duration: 6h
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: 33% of customers actively sending data; 4 of N Kafka partitions
Services: kafka, zookeeper, ingest, retriever

cascading failuredata lossnetworkqueue or stream

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

A ZooKeeper network partition the night before silently left the Kafka cluster in a fragile state. Around 6 a.m. PDT the next morning, the Kafka controller did something that exposed the latent damage and end-to-end checks began failing on four partitions. Engineers spent hours debugging a split-brain condition between restarted and un-restarted brokers, with offsets on data nodes drifting ahead of acknowledged offsets because Kafka kept accepting writes ZooKeeper had not acknowledged. Recovery required restarting all brokers and manually resetting offsets on data nodes. The underlying issue was attributed to Kafka bugs fixed in 0.10.2.1.

Impact

About a third of customers actively sending telemetry experienced partial write loss between 6:03 and 10:45 a.m. PDT, with most losing under half their writes. A majority of users also experienced roughly 30 minutes of partial read unavailability between 10:50 and 11:20 a.m. PDT.

Root cause

A ZooKeeper network partition the previous night left the cluster in a degraded state that did not surface symptoms until the controller thread acted on it the next morning.

Known Kafka bugs in the version Honeycomb was running (resolved in 0.10.2.1) caused brokers to accept writes that ZooKeeper had not acknowledged, allowing data-node offsets to advance past acknowledged offsets.

Restarting brokers without preserving quorum awareness produced a split-brain: restarted nodes only saw other restarted nodes, while unrestarted nodes only saw each other.

The team was operating Kafka at first-incident depth of expertise, so initial recovery moves were overly cautious and explored several red herrings down ZooKeeper paths.

Resolution

Engineers eventually restarted all Kafka nodes, then manually reset offsets on the data nodes and restarted or bootstrapped them so they aligned with what Kafka had actually durably committed. Service was fully restored by approximately 12:03 p.m. PDT.

Timeline

23:00MITIG
ZooKeeper cluster experiences a network partition overnight. No symptoms surface yet, but the cluster is left in a fragile state.
zookeeper
13:03DETECT
Around 6:03 a.m. PDT, partial write loss begins on four Kafka partitions (5, 6, 32, 33).
kafka
13:15DETECT
PagerDuty alerts at 6:15 a.m. PDT that end-to-end production checks are failing on those partitions.
kafka
13:30INVEST
Engineers find a Kafka node with effectively no inbound network traffic and confirm it is still listed in-sync for the failing partitions.
kafka
15:00INVEST
An attempt to bring up a new broker fails: it can only see the bad brokers, not the good ones. Restarting nodes makes things worse.
kafka
16:15INVEST
Around 9:15 a.m. PDT engineers realize a split brain has formed between restarted and un-restarted nodes.
kafka
17:45MITIG
Decision made to restart all Kafka nodes to break the split brain.
kafka
17:50DETECT
Between 10:45 and 11:20 a.m. PDT, broader ingest disruption is visible as the world is restarted; reads also degrade.
kafka
18:30MITIG
Data-node offsets are found to be ahead of acknowledged Kafka offsets. Engineers manually reset offsets on data nodes and restart or bootstrap them.
retriever
19:03RESOLV
By 12:03 p.m. PDT all partitions are healthy and writes are restored.
kafka

Attribution

Honeycomb

By Engineering

Published Oct 17, 2017

View original source

Lessons

A network partition in a coordination service can leave a downstream system in a latent broken state for hours before symptoms surface, decoupling the trigger from the visible incident.
When recovering a stateful cluster under pressure, every restart must preserve quorum visibility or the cluster can fracture into a split brain that is harder to resolve than the original problem.
First-time experience with a class of failure inflates time-to-resolution; investing in deliberate runbooks and exercises for unfamiliar failure modes pays back during real incidents.
Acknowledged offsets and durably-written offsets can diverge silently when the broker layer disagrees with the coordination layer, and assuming they match is a trap during recovery.

Action items

Upgrade the Kafka cluster to a version that includes the relevant controller fixes (0.10.2.1 or later).
Improve instrumentation for Kafka and ZooKeeper inside the team's own observability environment.
Get the Retriever Kafka partition assignment exposed via EC2 instance tags; write shell helpers to print Retriever-to-partition mappings.
Add instrumentation for failed writes to Kafka, including consumption of the Sarama response queue.
Change how data nodes handle invariants so that an offset ahead of the broker no longer requires manual intervention.
Document the bash snippets and command lines used during recovery into a Retriever and Kafka production runbook.