Kafka is hard
17 December 2025 — 1554 words — approx 8 minutes
This is the second time in my career that I have worked with Kafka. It is an incredible piece of software that solves a lot of hard problems. However, I find that Kafka sometimes adds new kinds of problems which can make it hard to build software on top of it. Today I would like to share why I think that is and some of the problems I have encountered.
What makes Kafka hard?
Kafka is, conceptually, quite simple. It is a durable, highly-available replicated log. Messages (records) are published and fetched from topics, and topics achieve parallelism via partitions (shards). Records are persisted to disk (for durability) and consumers fetch records in order within a partition. Every consumer keeps track of which records it has processed via a per-partition counter called an offset.
For simple use cases where you have low volume, where a total order of records doesn't matter, or where records can be produced to any partition, using Kafka is as easy as producing records at one end and consuming them at the other end.
Where Kafka becomes hard is when you have any of the following requirements:
- You need a total order of records for very high volume streams.
- You have a high volume stream that you want to shard over as few partitions as possible (for data locality within your consumers) but keep partitions as balanced as possible.
- You have consumers that need to batch records before they can commit offsets.
- Consumption lag must be bounded to avoid service degradation.
High volume streams
You want to distribute records over partitions as evenly as possible. This helps prevent hotspots which can lead to consumption lag in a small number of partitions (this is a painful situation to find yourself in by the way). The obvious answer here is to use round-robin such that every producer distributes records over all partitions equally, and it works great if you do not need a total order of records or if you do not need to maximize data locality.
However if you have any of these requirements then round-robin is not a good choice. What you can do instead is choose a partition key for each record and let Kafka decide which partitions own which partition keys (this is typically done via a hash function modulo the number of partitions for the topic).
But what happens if you have a few really hot partition keys? You could divide the key space up further by adding shard numbers to the end of the partition key. For example you can split up the partition key hotkey into two partition keys hotkey.0 and hotkey.1.
But how do you know which keys are hot and which keys are not? You have to find a way to track the volume of every partition key and adjust the shard factor for your entire keyspace (hopefully automatically). This is a lot of work! And you can't track this volume in the consumers because variations in consumer throughput will mess up your numbers. You instead need to track this data of out band.
And what if hotkey requires all records to be totally ordered? Now you cannot split up the keyspace because you have no guarantee records for hotkey.0 and hotkey.1 will be consumed in order. And if you did (by putting them on the same partition) then you would defeat the purpose of splitting up the keyspace anyway.
Consumers that batch
You may have consumers that need to batch records in memory for a period of time. For example, to build bloom filters for an index, or to compress data into a small number of files for archiving purposes. Whatever the reason, consumers that batch records come with a number of challenges.
First you want to ensure consumers are not assigned too many partitions. Depending on how your consumers process records, having too many partitions assigned to a consumer can be the difference between minor degradation (reduced throughput, but still making progress) and downtime (cascading failure, no progress can be made). The latter outcome is much more likely if your consumers need to build some kind of in-memory state and you have sized each consumer (in terms of CPU and memory) for a specific amount of work.
For example, suppose you have a topic with 100 partitions and 25 consumers, such that a consumer is expected to consume (on average) 4 partitions each. They each buffer 500MB of data in memory per partition, process the data, commit an offset and start again. You estimate that each consumer needs 2GB of memory for the in-memory buffers (4 × 500MB) plus some overhead (50%).
If some of your consumers become unavailable (perhaps the node they are running on has failed), Kafka will re-assign the partitions of the failed consumers to the remaining healthy consumers. However you have sized each consumer to expect an average of 4 partitions each, so unless you increase the memory limit of the surviving consumers, you now risk out of memory errors taking down the rest of your consumers. Out of memory is just one example of how re-assigning work that exceeds the total remaining capacity can cause cascading failures.
Ideally an overloaded consumer would tell Kafka "I cannot take any more partitions right now, please find someone else". However, such a feature does not exist today. One way around this is to manage partition assignment yourself. However, you risk re-implementing many of the features of consumer groups such as health checks, rebalancing, fetching and committing offsets, etc.
Second when a consumer has a partition revoked due to a rebalance it needs to decide what to do with the uncommitted records it holds in memory. It can either attempt to flush the data it has processed so far (even if incomplete) and commit the current offset, or discard all uncommitted data for the revoked partition and let the new consumer start again from the last committed offset. And if flushing is CPU or I/O intensive the consumer may not be able to commit its offset before the partition gets reassigned.
Consumption lag
Consumption lag is a measure of unprocessed records. It occurs when a consumer cannot keep up with the rate records are produced and unprocessed records start to accumulate in the partition. One of the main challenges with consumption lag is that recovery from consumption lag can be slow, especially if records require a lot of processing and the backlog is large, and yet you cannot recover faster by adding more consumers. If you are not careful consumption lag can even lead to permanent data loss if it exceeds a topic's retention period.
If the consumer is CPU bound and its work can be parallelized by adding more CPU cores you can increase its CPU. If you cannot scale a consumer any further or its work cannot be parallelized then you can try adding an escape hatch where a lagging consumer republishes its records (without processing them) to other partitions that are not lagging. However this also has problems:
- If the partition is lagging because the consumer is overwhelmed, it can republish records from its own lagging partitions to healthy partitions consumed by other consumers. One problem with this is that a future rebalance may cause a consumer to give work back to itself.
- Limits are needed to avoid republishing the same record over and over again to different partitions. For example, you may decide to republish a record up to a maximum of three times before forcing it to be processed.
- Republishing records from lagging partitions to healthy partitions may not be suitable if you need exactly once semantics.
If you don't need a total order of records within a partition then you can also try a scheduler/worker pattern where the scheduler consumes records from the partition (without processing them) and creates jobs in the form of offset ranges [start, end]. Workers pick up jobs from the scheduler, consume records between the start and end offsets, do their work and then inform the scheduler that the job is complete. The scheduler commits the end offset of the most recently completed job that does not create gaps (similar to how ACKs work in TCP's sliding window algorithm). This may not be suitable if you need exactly once semantics as when jobs fail they will need to be retried by other workers, and stateless schedulers may reschedule jobs that have completed but not yet committed.
Summary
Kafka is an incredible piece of software which solves a lot of hard problems. However, as we have discussed, it sometimes add new kinds of problems which can make it hard to build software on top of it. If you can isolate work effectively to individual records you can eliminate the requirement for total order and randomly distribute records to any partition to uniformly distribute load. Consumers that need to buffer specific amounts of data to do useful work present their own challenges which in many cases you can overcome if you are prepared to solve them yourself. Kafka is still an excellent piece of software, but like many things, getting the most out of it requires trial, error and experience.