Kafka is hard

17 December 2025 — 1602 words — approx 9 minutes

This is the second time in my career that I have worked with Kafka. It's an incredible piece of software that solves a lot of hard problems. However, while Kafka solves a lot of hard problems, it can sometimes add new ones. Today I would like to talk about why I think Kafka is hard and discuss some of the new problems that Kafka can present.

What makes Kafka hard?

Kafka is, conceptually, quite simple. It's a durable, highly available replicated log. Messages (records) are published and fetched from topics, and topics achieve parallelism via partitions (shards). Records are persisted to disk (for durability), and consumers fetch records in order within a partition. Every consumer keeps track of which records it has processed via a per-partition counter called an offset.

For simple use cases where a total order of records doesn't matter, or where records can be produced to any partition, using Kafka can be as easy as producing records at one end and consuming them at the other end.

Where Kafka becomes hard is when you have any of the following requirements:

  1. You need a total order of records for high volume streams.
  2. You have high volume streams that you would like to shard over as few partitions as possible (for data locality within your consumers) while keeping partitions as balanced as possible.
  3. You have consumers that need to batch records before they can commit offsets.
  4. Consumption lag must be bounded to avoid service degradation.

High volume streams

With Kafka, you want to keep partitions as balanced as possible. This helps prevent hotspots, which can lead to consumption lag in a small number of partitions (this is a painful situation to find yourself in). The obvious answer here is to use round-robin such that every producer distributes records over all partitions equally, and it works great if you don't need a total order of records nor need to maximize data locality within your partitions.

However, if you have either of these requirements, round-robin is not a good choice. What you can do instead is choose a partition key for each record and let Kafka decide which partitions own which partition keys (this is typically done via a hash function modulo the number of partitions for the topic).

But what if you have a few really hot partition keys? You can divide the key space by adding shard numbers to the end of partition keys. For example, you can split up the partition key hotkey into two partition keys called hotkey.0 and hotkey.1.

How do you know which keys are hot and which keys are not? You have to find a way to track the volume of every partition key and adjust the shard factor for each partition key in your keyspace (hopefully automatically). You can't track this volume in the consumers because variations in consumer throughput will mess up your numbers. You instead need to track this data out of band.

Once you've solved the problem of how to shard hot partition keys, you should start to think about how you're going to distribute the shards over your topic's partitions. For example, suppose your topic has 3 partitions, and hotkey has 3 shards: hotkey.0, hotkey.1 and hotkey.2. With a naive modulo function you may end up with hotkey.0 and hotkey.1 hashing to the same partition, such that one partition is doing 2/3s of the topic's total throughput while another partition is doing nothing at all. It would be much better if all partitions shared the throughput equally.

And what if hotkey requires all records to be totally ordered? Now you cannot split up the keyspace because you have no guarantee records for hotkey.0 and hotkey.1 will be consumed in order. And if you did (by putting them on the same partition), then you would defeat the purpose of splitting up the keyspace anyway.

Batching

You may have consumers that need to batch records in memory for a period of time. For example, to compress data into a small number of files for archiving purposes. Whatever the reason, consumers that batch records come with several challenges.

First, you want to ensure consumers are not assigned too many partitions. Depending on how your consumers process records, having too many partitions assigned to a consumer can be the difference between minor degradation (reduced throughput, but still making progress) and downtime (cascading failure, no progress can be made). The latter outcome is much more likely if your consumers need to build some kind of in-memory state and you have sized each consumer (in terms of CPU and memory) for a specific amount of work.

For example, suppose you have a topic with 100 partitions and 25 consumers, such that a consumer is expected to consume (on average) 4 partitions each. They each buffer 500MB of data in memory per partition, process the data, commit an offset and start again. You estimate that each consumer needs 2GB of memory for the in-memory buffers (4 × 500MB) plus some overhead (50%).

If some of your consumers become unavailable (perhaps the node they are running on has failed), Kafka will re-assign the partitions of the failed consumers to the remaining healthy consumers. However, you have sized each consumer to expect an average of 4 partitions each, so unless you increase the memory limit of the surviving consumers, you now risk out of memory errors taking down the rest of your consumers. Out of memory is just one example of how re-assigning work that exceeds the total remaining capacity can cause cascading failures.

Ideally, an overloaded consumer would tell Kafka "I cannot take any more partitions right now, please find someone else". However, such a feature does not exist today. One way around this is to manage partition assignment yourself, but you risk re-implementing many of the features of consumer groups, such as health checks, rebalancing, fetching and committing offsets.

Second, when a consumer has a partition revoked due to a rebalance, the consumer must decide what to do with the uncommitted records it holds in memory. It can either attempt to process the data it has buffered so far (even if incomplete) and commit the current offset, or discard all buffered data for the revoked partition and let the new consumer start again from the last committed offset. If rebalances require consumers to discard their buffered records, frequent rebalances may prevent consumers from making progress. And if flushing is CPU or I/O intensive, the consumer may not be able to commit its offset before the partition gets reassigned. Either of these can cause consumption lag.

Consumption lag

Consumption lag is a measure of unprocessed records. It occurs when a consumer cannot keep up with the rate records are produced, and unprocessed records start to accumulate in the partition. One of the main challenges with consumption lag is that recovery from consumption lag can be slow, especially if records require a lot of processing and the backlog is large, while at the same time you cannot recover faster by adding more consumers. If you are not careful, consumption lag can even lead to permanent data loss if it exceeds a topic's retention period.

If the consumer is CPU bound and its work can be parallelized by adding more CPU cores, you can increase its CPU. If you cannot scale a consumer any further or its work cannot be parallelized then you can try adding an escape hatch where a lagging consumer republishes its records (without processing them) to other partitions that are not lagging. However, this also has problems:

  1. If the partition is lagging because the consumer is overwhelmed, it can republish records from its own lagging partitions to healthy partitions consumed by other consumers. One problem with this is that a future rebalance may cause a consumer to give work back to itself.
  2. Limits are needed to avoid republishing the same record over and over again to different partitions. For example, you may decide to republish a record up to a maximum of three times before forcing it to be processed.
  3. Republishing records from lagging partitions to healthy partitions may not be suitable if you need exactly once semantics.

If you don't need a total order of records within a partition, you can also try a scheduler/worker pattern where the scheduler consumes records from the partition (without processing them) and creates jobs in the form of offset ranges [start, end]. Workers pick up jobs from the scheduler, consume records between the start and end offsets, do their work and then inform the scheduler that the job is complete. The scheduler commits the end offset of the most recently completed job that does not create gaps (similar to how ACKs work in TCP's sliding window algorithm). This may not be suitable if you need exactly once semantics, as failed jobs will need to be retried by other workers, and stateless schedulers may reschedule jobs that have completed but not yet committed.

Summary

Kafka is an incredible piece of software that solves a lot of hard problems. However, it can sometimes add new ones. None of the problems we discussed are unsolvable, however they can require out of band solutions which take a lot of time and effort to get right. In my experience, the more time you spend with Kafka the better you get at anticipating these problems, but perhaps what actually makes Kafka hard is knowing how to avoid them completely.