Project Metamorphosis: Unveiling the next-gen event streaming platformLearn More

42 Things You Can Stop Doing Once ZooKeeper Is Gone from Apache Kafka

Soon, Apache Kafka® will no longer need ZooKeeper! With KIP-500, Kafka will include its own built-in consensus layer, removing the ZooKeeper dependency altogether. The next big milestone in this effort is coming in Apache Kafka 2.8.0, where you will have early access to the new code, the ability to spin up a development version of Kafka without ZooKeeper, and the opportunity to play with the Raft implementation as the distributed consensus algorithm.

The blog post Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency discusses the problems with external metadata management, main architectural changes, and how ZooKeeper removal improves Kafka. Ultimately, removing ZooKeeper simplifies overall infrastructure design and operational workflows for your Kafka deployments. We’ve compiled a list of concrete benefits that result from this simplification, with a particular focus on things you will be able to STOP doing. It turns out that there are a lot of things you will be able to stop doing—and we think you won’t miss them.Zookeper removal from KafkaOnce ZooKeeper is removed as a dependency from Kafka, your life gets easier in a few different areas:

Administration

ZooKeeper is an entirely separate system from Kafka, with its own deployment patterns, configuration file syntax, and management tools. If you remove ZooKeeper from Kafka, you no longer have to administer a separate service. Even more so, with KIP-500, you can optionally deploy the controller and broker in the same JVM, which further simplifies administration. You can now stop:

  • #1: Learning about and operating yet another distributed system
  • #2: Administering additional servers, VMs, or containers for the ZooKeeper servers
  • #3: Having a separate security configuration for ZooKeeper that differs from the rest of the Kafka cluster
  • #4: Wondering whether it is spelled ZooKeeper or Zookeeper
  • #5: Sighing every time you see it spelled Zookeeper
  • #6: Working with systemctl for yet another Linux service (in contrast, with KIP-500, a controller and broker can optionally run in the same JVM)
  • #7: Maintaining version control on yet another properties file (same condition as above)
  • #8: Sharing the ZooKeeper ensemble between Kafka and non-Kafka services
  • #9: Redesigning topic and key patterns because of Kafka cluster partition limits (in contrast, with KIP-500, Kafka clusters support millions of partitions)
  • #10: Tuning broker timeouts to ZooKeeper—you can now forget about zookeeper.connection.timeout.ms and zookeeper.session.timeout.ms
  • #11: Questioning why you can’t run just Kafka
  • #12: Reading ZooKeeper release notes to learn about new feature availability when you upgrade
  • #13: Updating the ZooKeeper configuration when feature behavior changes, such as needing to explicitly allow Four Letter Words
  • #14: Performing quarterly rolling restarts of the ZooKeeper ensemble as part of patch management best practices
  • #15: Talking about how much you can’t wait for KIP-500

Capacity planning and disk

Storage is the main consideration for ZooKeeper deployments, and without ZooKeeper, you don’t have to deal with ZooKeeper capacity planning, disk issues, and snapshots. You can now stop:

  • #16: Going through the sizing exercise for each ZooKeeper server
  • #17: Co-locating ZooKeeper service on the same server running a Kafka broker. This is generally not recommended unless there is really low load, but we know some people still try!
  • #18: Figuring out the number of servers that should be in a ZooKeeper ensemble, to balance read capacity and write capacity (some Kafka clusters have just as many ZooKeeper nodes as Kafka nodes)
  • #19: Buying solid state drives (SSDs), which are recommended for ZooKeeper servers because of latency sensitivity
  • #20: Discovering during initial installation that the ZooKeeper servers do not have the necessary volume mounts
  • #21: Sharing directory paths for the ZooKeeper transactions log and snapshot directories
  • #22: Setting policy for purging old ZooKeeper data—you can now forget about autopurge.purgeInterval and autopurge.snapRetainCount
  • #23: Migrating ZooKeeper snapshots to newer drives with larger capacity

Performance

One of the key changes with KIP-500 is improved control plane traffic. Without KIP-500, broker operations require reading metadata for all topics and partitions from ZooKeeper, and this can take a long time in a large cluster. With KIP-500 though, brokers store metadata locally in a log and read only the latest set of changes from the controller (similar to how Kafka consumers can read the very end of the log, not the entire log), improving operations from O(N) to O(1). Therefore, these control plane operations have significantly better performance, so you can now stop:

  • #24: Twiddling your thumbs after a controller broker fails, waiting for a new controller to be elected and rebuilding state from ZooKeeper (in contrast, with KIP-500, a standby controller can be elected and it will already have state)
  • #25: Waiting around if brokers need to restart and then read full state (in contrast, with KIP-500, brokers persist their metadata caches across process restarts)
  • #26: Incurring O(N) cost for topic creation (in contrast, with KIP-500, topic creation no longer requires getting the full list of topics from the Zookeeper metadata—the time is just O(1) adding an entry to the metadata events log)
  • #27: Incurring O(N) cost for topic deletion

モニタリング

Any service in your mission-critical deployment must be monitored, and if you are using ZooKeeper, it must be monitored like every other service in your Kafka deployment. So if you remove ZooKeeper, you can stop:

  • #28: Getting lost in busy Grafana/Kibana/Datadog monitoring dashboards that show a bunch of ZooKeeper metrics
  • #29: Figuring out what alert levels to set on which ZooKeeper JMX metrics—you can now forget about NumAliveConnections, OutstandingRequests, AvgRequestLatency, MaxRequestLatency, HeapMemoryUsage, etc.
  • #30: Figuring out what alert levels to set on which Kafka JMX metrics related to ZooKeeper—you can now forget about ZooKeeperDisconnectsPerSec, ZooKeeperExpiresPerSec, ZooKeeperReadOnlyConnectsPerSec, ZooKeeperSyncConnectsPerSec, ZooKeeperAuthFailuresPerSec, ZooKeeperSaslAuthenticationsPerSec, etc.
  • #31: Answering late night pages when ZooKeeper servers go down
  • #32: Getting alerts when the disk utilization on ZooKeeper servers go above a configured threshold
  • #33: Setting up log management for an extra log file
  • #34: Reading ZooKeeper transactions log and snapshots with unique formatters to investigate issues—you can now forget about org.apache.zookeeper.server.LogFormatter and org.apache.zookeeper.server.SnapshotFormatter

Troubleshooting

If issues emerge in your Kafka deployment, ZooKeeper creates an added element that requires investigation. Without ZooKeeper, troubleshooting can focus on the core components, so you can now stop:

  • #35: Having the /var/log/messages file fill up from verbose logs during network outages
  • #36: Troubleshooting IP connectivity issues between the ZooKeeper ensemble and Kafka clients, if you haven’t yet migrated your clients and tools away from ZooKeeper
  • #37: Dealing with problems related to divergent state between the controller and ZooKeeper (in contrast, with KIP-500, rather than pushing notifications to brokers, brokers consume all events, in order, from the metadata events log)
  • #38: Troubleshooting issues when it turns out that the Kafka cluster is configured to connect to the wrong ZooKeeper ensemble (in contrast, with KIP-500, this situation is less likely to arise when the broker and controller are co-located)
  • #39: Encountering and resolving surprise production-impacting issues when upgrading ZooKeeper
  • #40: Sweating when ZooKeeper doesn’t start after an upgrade, such as in the case of missing snapshot files
  • #41: Trying to remember where in the hierarchical tree structure a certain znode exists
  • #42: Digging through runbooks for the command to find out which ZooKeeper server is the leader when your host doesn’t have nc installed, perhaps due to enterprise policy (hint: echo srvr | (exec 3<>/dev/tcp/zk-host/2181; cat >&3; cat <&3; exec 3<&-) | grep -i mode)

Next steps

Even though KIP-500 isn’t fully implemented yet, right now you can swing your tools from getting metadata from ZooKeeper over to getting metadata from the brokers instead, as described in the blog post Preparing Your Clients and Tools for KIP-500: ZooKeeper Removal from Apache Kafka.

Read it Now

With ZooKeeper Without ZooKeeper
Configuring clients and services zookeeper.connect=zookeeper:2181 bootstrap.servers=broker:9092
Configuring Schema Registry kafkastore.connection.url=zookeeper:2181 kafkastore.bootstrap.servers=broker:9092
Kafka administrative tools kafka-topics --zookeeper zookeeper:2181 ... kafka-topics --bootstrap-server broker:9092 … --command-config <properties to connect to brokers>
REST Proxy API v1 v2 or v3
Getting the Kafka cluster ID zookeeper-shell zookeeper:2181 get /cluster/id kafka-metadata-quorum or view metadata.properties or confluent cluster describe --url http://broker:8090 --output json

And then try out the early access code coming in the next major Kafka release. Stay tuned for the Apache Kafka 2.8.0 release blog post for more details.

Yeva Byzek is an integration architect at Confluent designing solutions and building demos for developers and operators of Apache Kafka. She has many years of experience validating and optimizing end-to-end solutions for distributed software systems and networks.

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Helpful Tools for Apache Kafka Developers

Apache Kafka® is at the core of a large ecosystem that includes powerful components, such as Kafka Connect and Kafka Streams. This ecosystem also includes many tools and utilities that

Apache Kafka Lag Monitoring at AppsFlyer

This article covers one crucial piece of every distributed system: visibility. At AppsFlyer, we call ourselves metrics obsessed and truly believe that you cannot know what you cannot see. We

Ensure Data Quality and Data Evolvability with a Secured Schema Registry

Organizations define standards and policies around the usage of data to ensure the following: Data quality: Data streams follow the defined data standards as represented in schemas Data evolvability: Schemas