Project Metamorphosis: Unveiling the next-gen event streaming platform.Learn More
Confluent Platform

Enterprise Streaming Multi-Datacenter Replication using Apache Kafka

One of the most common pain points we hear is around managing the flow and placement of data between datacenters. Almost every Apache Kafka user eventually ends up with clusters in multiple datacenters, whether their own datacenters or different regions in public clouds. There are many reasons you might need data to reside in Kafka clusters spread across multiple datacenters.

  • Geolocalization improves latency and responsiveness for users. However, not all data can be partitioned between datacenters; some may need to be available across all datacenters.
  • Backups for disaster recovery are a must for any mission critical data. To protect against natural disasters these backups must be geographically distributed, so stretched clusters are not an option.
  • Analytics applications spanning your entire dataset will need data from multiple datacenters aggregated into a single global cluster.
  • Sometimes you’ll need to migrate data to a new datacenter. This might be to decommission your datacenter or as part of a migration to the cloud.

These different use cases lead to a number of different deployment patterns for Kafka, but they all share a common requirement: data needs to be copied from one Kafka cluster to another (and sometimes in both directions).

enterprise streaming multi-datacenter replication

Although Apache Kafka ships with a tool called MirrorMaker that allows you to copy data between two Kafka clusters, it is pretty barebones. We regularly see users encounter gaps that make it more difficult to execute a multi-datacenter plan than it should be. For example, MirrorMaker will copy each message from the source cluster to destination cluster, but doesn’t keep anything else about the two clusters in sync: topics can end up with different numbers of partitions, different replication factors, and different topic-level settings. Trying to manually keep all these settings in sync is error prone and can lead downstream applications, or even replication itself, to fail if the systems fall out of sync. It also has a complex configuration spread across multiple files, requires keeping those configs in sync across instances, and is yet another service to deploy and monitor. We have consistently seen the need for an easier, more complete solution to working with Kafka across multiple datacenters.

Today, I’m happy to introduce Multi-Datacenter Replication, a new capability within Confluent Platform that makes it easy to replicate your data between Kafka clusters across multiple datacenters. Multi-Datacenter Replication fully automates this process, whether you’re aggregating data for analytics, need a backup, or are migrating from on prem to the cloud.

Confluent Platform now comes with these important features out of the box:

  • Dynamic topic creation in the destination cluster with matching partition counts, replication factors, and topic configuration overrides.
  • Automatic resizing of topics when new partitions are added in the source cluster.
  • Automatic reconfiguration of topics when topic configuration changes in the source cluster.
  • Topic selection using whitelists, blacklists, and regular expressions. Limit the data you replicate to just the set of data you need.
  • Support for replicating between secure Kafka clusters.
  • Scalability and fault tolerance baked in by leveraging Kafka’s Connect API
  • Integration with Confluent Control Center: configure, deploy, and monitor Multi-Datacenter Replication with ease.

With these features you get hassle-free replication of data, at scale, without writing a single line of code.

At its core, Multi-Datacenter Replication is implemented via a Kafka source connector that reads input from one Kafka cluster and passes that data to the API to be produced to the destination cluster. This means Multi-Datacenter Replication is enabled by all the benefits you’ve come to expect from the Kafka Connect API: the ability to scale elastically, fault tolerance, and a shared, easy to monitor infrastructure for data integration. This also means you can monitor Multi-Datacenter Replication in Confluent Control Center with the same ease as you would monitor any other connector.

As an example of how you can immediately start leveraging Multi-Datacenter Replication, let’s look at an aggregation use case. Suppose you have an application running in multiple data centers. In each DC you have a Kafka cluster that your application reports data about user actions to. You can process the resulting data within each datacenter independently, but to get a global view of what your users are doing you would need to combine the data sets. For some simple types of analysis you might be able to combine the results from each datacenter, but in many cases, you’ll need to get all the data in one place and then run your analysis.

MDC implementation

The image shows how you would implement this pattern with Multi-Datacenter Replication:

  1. Provision the destination Kafka cluster with a cluster of Kafka Connect workers. (You can test with a standalone worker, but distributed mode is recommended for production environments for scalability and fault tolerance).
  2. On the Connect cluster you provisioned, configure one source connector for each source Kafka cluster. For each connector, specify:
    1. Connection information for the source Kafka cluster, including its Zookeeper.
    2. Connection information for the destination Zookeeper. (Kafka connections are defined at the worker configuration level.)
    3. The set of topics you want copied, as a regex, blacklist, or whitelist.
    4. How topics should be renamed. By default topic names will be maintained exactly, but when data from multiple clusters is being aggregated you’ll want to add a prefix or suffix to identify the source datacenter, keeping the data in the aggregate cluster isolated and organized. This is especially important if you need consumers in the destination cluster to only read data from a specific data-center.
  3. Start your connector and watch your data start flowing into your destination DC.

Here you can see how simple setting up Multi-Datacenter Replication from Confluent Control Center is:

multi datacenter replication replicator-source

The connector should start creating topics in the destination cluster immediately and then start replicating messages. If you create new topics or modify settings on existing topics, you should see those reflected shortly in the destination cluster without any manual intervention.

As you can see, after some initial setup, Multi-Datacenter Replication can be mostly hands-off while ensuring your data is correctly replicated. If you want to see how easy it is for yourself, download Confluent Platform to try it out and check out the documentation for more details, configuration options, and examples. While Multi-Datacenter Replication is easy to implement, planning for a Multi-datacenter architecture can be more complex and involve many decisions, trade-offs and possible pitfalls. For this reason, we created the Multi-datacenter services engagement for you to consult with an expert and plan out your Multi-datacenter deployment.

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Highly Available, Fault-Tolerant Pull Queries in ksqlDB

One of the most critical aspects of any scale-out database is its availability to serve queries during partial failures. Business-critical applications require some measure of resilience to be able to […]

Real-Time Data Replication with ksqlDB

Data can originate in a number of different sources—transactional databases, mobile applications, external integrations, one-time scripts, etc.—but eventually it has to be synchronized to a central data warehouse for analysis […]

15 Things Every Apache Kafka Engineer Should Know About Confluent Replicator

Single-cluster deployments of Apache Kafka® are rare. Most medium to large deployments employ more than one Kafka cluster, and even the smallest use cases include development, testing, and production clusters. […]

Sign Up Now



By clicking “sign up” above you understand we will process your personal information in accordance with our プライバシーポリシー

上記の「新規登録」をクリックすることにより、お客様は以下に同意するものとします。 サービス利用規約 Confluent からのマーケティングメールの随時受信にも同意するものとします。また、当社がお客様の個人情報を以下に従い処理することを理解されたものとみなします: プライバシーポリシー

単一の Kafka Broker の場合には永遠に無料

商用版の機能を単一の Kafka Broker で無期限で使用できるソフトウェアです。2番目の Broker を追加すると、30日間の商用版試用期間が自動で開始します。この制限を単一の Broker へ戻すことでリセットすることはできません。

Manual Deployment
  • tar
  • zip
  • deb
  • rpm
  • docker
  • kubernetes
  • ansible

By clicking "download free" above you understand we will process your personal information in accordance with our プライバシーポリシー

以下の「ダウンロード」をクリックすることにより、お客様は以下に同意するものとします。 Confluent ライセンス契約 Confluent からのマーケティングメールの随時受信にも同意するものとします。また、お客様の個人データが以下に従い処理することにも同意するものとします: プライバシーポリシー

このウェブサイトでは、ユーザーエクスペリエンスの向上に加え、ウェブサイトのパフォーマンスとトラフィック分析のため、Cookie を使用しています。また、サイトの使用に関する情報をソーシャルメディア、広告、分析のパートナーと共有しています。