When I have a small software project that I want to share with the world, I don’t write my own version control system with a web UI. I don’t even try to run similar software on a computer on someone’s datacenter. I don’t write a document analyzing the pros and cons of each decision.
Instead, I just create a repository on GitHub.
To be fair, GitHub is free for open source projects, so this is an easy call. And what if GitHub cost me $5 a month? I’m paying this much for my to-do list management software. Five dollars a month is quite easy for an engineer to afford even out of her own pocket.
Cloud services can have a wide price range. Apache Kafka® is free, and Confluent Cloud is very cheap for small use cases, about $1 a month to produce, store, and consume a GB of data. As your usage scales and your requirements become more sophisticated, your cost will scale too.
This is quite typical for managed databases as a service—they can be very low cost for casual use and very expensive when you use them in anger. This is what usage-based billing is all about, and it is one of the biggest cloud benefits.
With large-scale use cases, it is quite natural to seriously consider just running your own Kafka or MinIO instead of Amazon S3, for that matter. And of course, you should seriously consider a decision of this magnitude. You just need to consider it rationally. And I noticed that many software engineers and engineering managers do not always do this.
Even at the low end of the scale, where a managed service is ridiculously inexpensive, I see engineers run their own Kafka and not even consider a managed service. When I ask why, the responses usually include: the joy of running Kafka themselves, the career opportunities it creates, and perhaps most commonly, a sense of futility—“my manager would never approve this expense.”
When you talk to engineering managers, the responses vary. Sometimes they trust their team’s ability to deliver quality service more than they trust a service provider. But quite often, the managers themselves don’t know how to calculate the trade-offs involved and how to justify the necessary budget. If you are an engineering manager, then you have years of practice going to your manager and saying, “This year, my team is also running Kafka, we are spending 20 hours a week on maintenance. I need an additional headcount.”
What you likely have less experience with is going to your manager and saying, “This year, my team is building a real-time inventory management system. This requires an event streaming platform. We decided to use Confluent Cloud, and I need a budget of $7,000 a month for our use case.” Engineering organizations are built to hire engineers. Managers are incentivized and get promoted for building large teams, and no one seems to know how to convert this budget into managed services.
As an industry, we did not learn how to make great decisions about the use of managed services. It is time we up our game.
If you are going to run Kafka on AWS, you’ll need to pay for EC2 machines to run your brokers. If you are using a Kubernetes service like EKS, you pay for nodes and for the service itself (Kubernetes masters). Most relevant EC2 types are EBS store only and Kubernetes only supports EBS as a first-class disk option, which means you need to pay for EBS root volume in addition to the EBS data volume. Don’t forget that until KIP-500 is merged, Kafka is not just brokers—we need to run Apache ZooKeeper™ too, adding three or five nodes and their storage to the calculation. The way we run Kafka is behind a load balancer (acting partially as a NAT layer), and since each broker needs to be addressed individually, you’ll need to pay for the “bootstrap” route and a route for each broker.
All these are fixed costs that you pay without sending a single byte to Kafka.
On top of this, there are network costs. Getting data into EC2 costs money, and depending on your network setup (VPC, private link, or public internet), you may need to pay both when sending and receiving data. If you replicate data between zones or regions, make sure you account for those costs too. And if you are routing traffic through ELBs, you will pay extra for this traffic. Don’t forget to account for both ingress and egress, and keep in mind that with Kafka, you typically read 3–5 times as much as you write.
Now we are running the software, ingesting data, storing it, and reading it. We’re almost done. 🙂 You need to monitor Kafka, right? Make sure you account for monitoring (Kafka has many important metrics)—either with a service or self-hosted, and you’ll need a way to collect logs and search them as well. These can end up being the most expensive parts of the system, especially if you have many partitions, which increases the number of metrics significantly.
But it doesn’t mean you should avoid considering them, or you will end up paying these costs later whether you want to or not.
It starts with capacity planning. Ideally, you start with some idea of what workload you will run on the cluster—MB/s ingress and egress, number of partitions, number of concurrent connections, connection rate, and request rate. Realistically, no new project ever estimates these correctly. This means that you will plan capacity based on some guesses and over-provision to provide a buffer when the guesses inevitably turn out to be wrong. Capacity planning takes time, which is money, and you have to pay for all the over-provisioned capacity, too.
If you still get capacity wrong and under provisioned, you’ll pay the price in availability—which also means you and your on-call rotation will get paged, sometimes with rather mysterious issues, leading to significant time spent trying to solve all those problems. Expanding an already loaded cluster is a very challenging problem. An overloaded cluster will not have the spare bandwidth, IO, and CPU that you need in order to move workloads around. Time spent troubleshooting and the downtime involved also have cost implications.
Getting the capacity right involves more than just choosing the number of brokers. At Confluent, we spend significant time choosing the right components and making sure they are all aligned and optimized—the right machine types, right disk types, right disk sizes, broker configuration, zone alignment, load balancers, and a lot more.
Then, there is routine maintenance. Not tons, but definitely enough to keep people busy. You’ll want to upgrade regularly, especially with bugfix releases and security patches. Being on top of latest bug fixes is critical for avoiding disastrous incidents; it is heartbreaking to see customers lose data due to a bug that was fixed a year ago. New releases also open up new capabilities, including better configuration and further monitoring. You’ll want to stay on top of this and make sure you collect the latest metrics—they were added for a reason. It will also take significant time to tune alerts and make sure you know of impending disasters early while not drowning in noise.
Another important aspect of routine maintenance are cluster rebalances and expansions. You’ll want to watch for early signs of workload imbalance and move partitions around to get to a balanced state. This improves performance and indirectly reduces your costs. More importantly, watching for early signs of overload and proactively expanding the cluster keeps you from trying to expand an already overloaded cluster.
There is a cost to elasticity or lack thereof. One of the more interesting aspects around maintenance is when your cluster has periodic workloads. You may know that you need twice the capacity for Black Friday, or on weekend events, or daily between 5:00 p.m. and 12:00 a.m. Do you have the ability to shrink and expand the cluster at will? Does it happen frequently enough that you need to automate it? Are you running in the cloud where you have some degree of elasticity or in an old-school datacenter where you need to order your entire capacity three months in advance?
In these cases, either you run at maximum capacity 100% of the time, paying the cost of capacity that you don’t always use, or you pay in time and effort for manual expand/shrink operations or the effort of automating it.
“Why is Kafka slow?” tax. Regardless of how well you planned capacity and tuning, someone is bound to ask, “Why is Kafka so slow?” Maybe they expected 20 ms of latency and are seeing 40 ms. Maybe it is usually 20ms, but a few times a day it spikes to 2,000 ms. Finding the answer and fixing issues can be incredibly time consuming even with a team of experts and near impossible if your Kafka team is also the Apache Cassandra team and Elasticsearch team. Pay the price of hiring and training experts, or pay the price of living with slowdowns.
Hybrid cloud tax. Running on multiple clouds brings its own level of challenges. Even with Kubernetes, each cloud requires a capacity planning exercise—not all vcores are made equal; storage and networks also differ. You’ll also need to learn all about their network routing idiosyncrasies and their different defaults for Kubernetes hosts. Load testing is time consuming, and hybrid cloud doubles or triples the cost. If you rely on a managed service that only exists in one cloud, you get to enjoy the benefits of a managed service in one environment but need to pay all the DIY tax in another.
We started by talking about actual costs that you get billed for by your vendors and service providers. Then we talked about costs of the “time is money” variety—those are harder to quantify, but since we have some estimates for engineering salaries, there’s a fairly straightforward formula for converting time into money. But there are some costs that are nearly impossible to put a specific price tag on. This doesn’t mean they are not important. In fact, the opposite is true: they are hard to quantify because in the worst case, the price is the entire company.
Time to market. Quite a few of the “engineering time” items we mentioned in the previous section have to happen before your application is deployed to production, such as capacity planning, monitoring, and tuning. Any delay, due to lack of experience or just the fact that this is challenging work, delays your product or application from being released to its users. In some cases, the delay doesn’t matter at all, but in others, it gives a competitor a critical advantage.
Engineering happiness. If you build it and deploy it, you also hold the pager for it and are on the hook for all maintenance tasks. This type of maintenance work isn’t what gets engineers excited to go to work in the morning. We can tolerate a bit of paging and even routine maintenance once or twice (provided that we have a clear plan on how to automate it away). But if there is no plan, and if there is too much stuff that isn’t really “engineering” on our plate, we become very unhappy very quickly. This may lead to retention problems and churn, which are pretty easy to quantify. But in the worst scenario, you have engineers who are disengaged and unmotivated in their current position.
Risk. If you have world-class experts running the service, the risk is not huge. It is still there, but with a dedicated team of experts and some over-provisioning, 99.95% uptime isn’t impossible to reach. If your engineers are learning on the job, as is frequently the case, you risk downtime, data loss, security breaches, and compliance issues. The impact can range from apologizing to customers to facing a crippling loss of trust from customers and hefty government fines. When calculating the cost of downtime, don’t forget to account for the loss of business during the downtime, the length of downtime (it takes longer for nonexperts to solve issues), the time engineers spend figuring out, solving issues, deploying remediations and recovering (if someone worked until 4:00 a.m. on an incident, don’t expect much productivity the next day), and frequency of issues.
Now that you’ve accounted for everything that DIY costs you, it’s time to compare different options. The alternatives to DIY are hosted or managed offerings.
When comparing different hosted and managed offerings, it is super important to check what is included in the price and what is an “extra.”
Do you pay for brokers? For traffic? For storage? What about traffic over the public internet? Between zones and regions? Do you pay for ZooKeeper nodes too? What about “special” network configurations—VPC peering, private link, and NAT gateways? Are “batteries included” when it comes to capacity planning, upgrades, and elasticity? Or do you still need to invest in engineering effort? If they claim to be elastic, do they balance the load for you, or are you on the hook for the most challenging part? Can you shrink or just expand?
Note that Kafka has relatively “thick” clients, so make sure the vendor of choice has the capability to troubleshoot client issues and take on the dreaded “Kafka is slow” question, rather than just responding with “the server is fine.”
After you’ve calculated exactly how much each option will cost, I recommend provisioning what you think are equivalent-sized clusters in your top options and to use the Kafka perf producer and consumer to load them. Run the test for a few hours to make sure you see the sustained capacity and not “burst capacity,” and make sure you also keep an eye on latency. If latency becomes unacceptably high, you’ll want to reduce throughput to keep it within acceptable boundaries.
The last step is to use the total cost and throughput from the very basic benchmark to calculate the price in dollars per MB/s. This is a standard way to compare cost-effectiveness. It’s been used in TPC-C benchmarks to compare databases for the last few decades.
Now you can compare your options like a pro.
One last thing to remember when comparing providers: Not all SLAs are equal, even if all providers claim 99.95% uptime in their SLAs. There are a lot of important details related to how they measure “availability” and how easy it is to process SLA-breach claims.
It is worth noting that both you and your cloud provider can do quite a bit to reduce the cost per MB/s, so as to run Kafka more efficiently and either get more throughput from the same setup or get the same throughput from a more cost-effective setup.
The easiest way you can improve your throughput without provisioning additional capacity is by sending data to Kafka in a more efficient way. The difference between efficient and inefficient use of Kafka can be more than three times the throughput on the same hardware.
Luckily, the advice on being more cost efficient is exactly the same as performance tuning advice, so it is fairly easy to find. Use the sticky partitioner,
tune linger.ms, use fewer clients, and maybe even fewer partitions to send the same throughput using fewer requests.
All these performance improvements lead to greater efficiency regardless of who runs your Kafka brokers.
However, it is nice when your vendor shares your mindset. Confluent is a bit obsessive about Kafka performance (in a good way). We strive to introduce small performance improvements every few weeks and those add up. Those improvements ship as part of Apache Kafka releases, but we deploy from master to our cloud clusters every two weeks, so you can benefit from those improvements weeks or months before the Apache Kafka version is even released, not to mention the months or years it takes many companies to upgrade.
The key to making good choices regarding managed services boils down to setting aside institutional traditions and dysfunctional incentives and focusing on making an economic decision.
Because we are not used to making economic decisions, we often miss some key costs. I remember the day I realized that 50% of our costs were network, and that for each four-broker cluster, we also paid for three ZooKeeper nodes and three Kubernetes master nodes. That cross-zone traffic would have been less expensive if it were within the same VPC, used ELB, or was a full moon.
You also need to know how to convert engineering time to budget and vice versa. We had a weird SSL issue that caused us to perform many more IOPS than expected. We could solve it by replacing the disks with SSDs, so the storage would be capable of delivering all the IOPS we needed, or we could re-implement part of the Java SSL stack inside Kafka to work around the issue. Without knowing how to compare costs of SSDs overtime to cost of engineering effort, it’s hard to make the right choice. (Spoiler: Switching to SSDs was an additional $50,000 a year. The bug took two weeks to fix. It turns out that “throw hardware at the problem” isn’t always the right call).
The hardest part of all is the part when you, as an engineer, need to make a compelling case to your managers, without sounding like you are whining or trying to get out of doing your job.
And if you’re interested in a free TCO assessment to see how much you can save with Confluent Cloud, let us know and we’d be happy to provide that for you.
To learn more, check out the Cost-Effective page as part of Project Metamorphosis.
Gwen Shapira is an engineering leader at Confluent. She has over 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen is the author of “Kafka—The Definitive Guide” and “Hadoop Application Architectures,” and she is a frequent presenter at industry conferences. Gwen is a PMC member on the Apache Kafka project and a committer on Apache Sqoop. When Gwen isn’t building data pipelines or thinking up new features, you can find her pedaling on her bike exploring the roads and trails of California, and beyond.