To effectively use ksqlDB, the streaming database for Apache Kafka®, you should of course be familiar with its features and syntax. However, a deeper understanding of what goes on underneath ksqlDB’s surface can help you make better decisions as you design your streaming architectures. You should ideally know something about ksqlDB’s basic internals, where it fits into the Kafka ecosystem, and the functionality of its advanced features. In this complete introduction, we’ll provide an in-depth overview of how stateless and stateful operations work, how joins work, and how ksqlDB supports complex features like elastic scaling, fault tolerance, and high availability. Get an introduction to each of these topics below, and for an in-depth treatment, you can watch the free course Inside ksqlDB on Confluent Developer.
Implementing ksqlDB lets you significantly reduce the complexity of your streaming pipelines. Because ksqlDB includes primitives for connecting to external data sources and for processing your data, you no longer need external components to perform those functions. And ksqlDB also supports materialized views, which create data sets that can be queried directly by your application just like a database table. Scaling, securing, monitoring, debugging, and operating are all made easier through the architectural simplifications brought about by ksqlDB.
Streams in ksqlDB are backed by Kafka topics. When you issue a command to create a ksqlDB stream, it communicates with your Kafka brokers to create the underlying Kafka topic that you have specified—if it doesn’t exist. After you have created a stream, you put data into its rows using simple SQL statements, each of which corresponds to the invocation of a Kafka client producer call (each row being a Kafka record). Once you have your data, it is straightforward to transform it using one or multiple persistent queries, which execute Kafka Streams topologies under the hood.
When operations require state, things get a bit more complex under ksqlDB’s hood. ksqlDB utilizes materialized views, tables which maintain running, aggregate calculations that are incrementally adjusted as new data points arrive. Materialized views are stored in a RocksDB database for fast access, and queries against them are extremely fast (ksqlDB ensures that a key’s rows appear in a single partition). Because materialized views are reductive—that is, they only keep the aggregation—the full history of changes made to a view also get stored in a changelog Kafka topic, which can be replayed later to restore state should a materialized view go lost. ksqlDB’s two kinds of queries, pull and push, can both fetch materialized view data, but the former queries and terminates in a traditional relational manner, while the latter stays alive to capture streaming changes.
Stream processing applications tend to be based around multiple independent event streams, thus joins are essential. In ksqlDB, you can join streams to streams, streams to tables, and tables to tables. (In other words, you can join data at rest with data in motion.) Watch below for animated explanations of stream-table joins and table-table joins.
Enriching a stream with another set of data is a common ksqlDB task, and one you can easily practice executing on Confluent Cloud. In this hands-on exercise, you can begin by learning how to set up and populate an empty stream and table. Then run a stream-table join query and persist the enriched data to a new stream. Proceed by opening an instance of Confluent Cloud, where you can apply the promo code
KSQLDB101 to receive $101 of free usage.
When you add servers to your ksqlDB cluster, ksqlDB automatically rebalances and reassigns the right processing work to the right servers at the right time. This happens dynamically, safely, and automatically. In a stateless scenario, partitions are divided among the available servers, so two servers process twice as fast as one, and eight process eight times as fast as one. With a stateful operation, such as a materialized table, the work is similarly divided among servers but state needs to be sharded: The same row with the same key needs to always go to the same partition—and therefore to the same server in the cluster. This reshuffling happens automatically. In addition, the backing data for each piece of state is written to a changelog, so it can be replayed to a new server if a node fails.
If you add servers to your cluster and configure them for high availability, they can immediately be switched in if a node fails. This works because high availability servers proactively and aggressively play in changelogs to their local stores on an ongoing basis, meaning that the changelogs don’t have to be replayed from the beginning when a node needs to be restored. However, this system is only eventually consistent, so you are able to bound the staleness of pull queries against replica servers.
To learn even more into ksqlDB’s internals, make sure to check out the full course at Confluent Developer: Inside ksqlDB. You may also be interested in listening to an episode of Streaming Audio, where Tim Berglund speaks with ksqlDB’s Principal Product Manager, Michael Drogalis.
Michael Drogalis is Confluent’s stream processing product lead, where he works on the direction and strategy behind all things compute related. Before joining Confluent, Michael served as the CEO of Distributed Masonry, a software startup that built a streaming-native data warehouse. He is also the author of several popular open source projects, most notably the Onyx Platform.