Tuning Truly Global Kafka Pipelines

“Running Kafka pipelines is like flying a jet engine…”—someone commented during the Tuning Kafka Pipelines talk at the Silicon Valley Code Camp, 2017—“… You have to look at the ratios and not rpms like in a car engine….”

I could not agree more. Setting up Kafka is just the beginning. Monitoring and tuning Kafka pipelines for reliability, performance, latency, uptime, maintenance efficiency pose the true challenge in running Kafka in production.

In this talk, I’ve shared my experience and some valuable lessons of tuning the performance of a truly global production Kafka pipeline at Linkedin.

Furthermore, this was one of the best audience I’ve seen at the SVCC in years. There were more than 20 questions from the audience. Very engaged and knowledgeable. Check it out!

1:46 Here’s how a truly global kafka data pipeline looks like

5:30
Q: Is it not efficient to get data from follower replica?
A: Theoretically, it is possible to read data from the follower replica as long as the follower is in sync with the leader. However, Kafka does not support that (just yet).

6:15
Q: So the replication is just ensure resiliency in case broker #2 fails?
A: That’s right. If one of the brokers fails, the consumers can resume consumption from the follower replica.

8:56
Q: Why Kafka pipelines?
A: Kafka provides a very large, reliable, durable buffer between various stages in the pipeline. There are many other reasons too.

11:00
Q: What is the smallest Kafka pipeline?
A: One producer, one Kafka broker, one consumer

12:24 Running Kafka is not a trivial thing. Kafka needs a Zookeeper cluster.

12:38
Q: When is the right time to introduce Kafka?
A: Right now! Kafka should be multi-tenant. Multiple apps should be able to run through the same pipeline.

13:35 In a single process of mirror-maker there are multiple consumers but one producer

14:18 No single point of failure. Kafka achieves that by replicating data. Kafka pipelines achieve that by replicating data over multiple datacenter.

15:13 LinkedIn’s crossbar topology of Kafka pipelines. Active-active replication. Supports traffic shifts.

16:27 Durability interacts with latency and throughput. Highest durability with acks=all

18:11 If you want to increase throughput of the entire Kafka cluster, add more brokers and hosts. If you want to increase throughput of a specific application, increase the # of partitions. Pipeline throughput depends on number of things including colocation. “Remote consume” achieves the highest pipeline throughput.

20:16 Producer performance tuning. Effect of batching, linger.ms, compression type, acks, max in flight requests, etc.

23:30 Consumer performance tuning. More partitions more parallelization, fetch message max bytes.

25:13 Broker performance tuning. Performance depends on degree of replication. number of replication threads, batching. inter-broker replications over ssl has overheads

26:55
Q: Do the number of partitions affect producer performance?
A: I totally screwed this one up. They do. Increasing partitions allow more parallelism and more throughput. Producer can send more batches in flight. What was I thinking?

29:17
Q: How many partitions should be used? How do you determine the # of partitions?
A: There’s a formula. The answer depends on qps (messages per second), average message size, and retention time. These three factors multiplied gives you the total size on the disk of a single replica. However, size of replica partition on a single machine should not exceed over X GB (say 25 GB) because moving partitions (when needed) takes a very long time. Secondly, if a broker crashes and comes back up, it needs to rebuild the index from the stored log. The time for reboot is proportional to the size of partition on disk.

30:00 Kafka pipeline latency is typically a few 100 milliseconds. However, the SLA (for customers) of a pipeline would be much longer (in minutes) because pipeline maintenance may have some downtime when SLA can’t be met.

32:53
Q: Do you talk about Zookeeper tuning?
A: High write throughput is a problem with zookeeper.

33:00 Performance tuning of truly global production Kafka pipelines. Expected global replication SLA was under 30 minutes for 100 GB data (single topic). Pipeline to Asia had really long tail (3 hrs). Kafka Mirror Maker is a CPU-bound process due to decompression.

38:00
Q: What’s happening on the either side of the ramp in the chart?
A: It’s a Hadoop push of 100 GB data in 12 minutes. Hence ramp up and ramp down.

39:00 Production pipeline machine setup: 100 GB data pushed in 12 minutes, 840 mappers, 4 large brokers, broker replication over SSL, ACK=-1, 4 KMM clusters, 10 KMM processes each.

40:15
Q: Are all mirror-makers on the source datacenter or not?
A: Only one pipeline (out of 4) has mirror-makers on the destination datacenter.

41:16 Text book solution for the mirror-maker performance tuning. Remote Consumer-Local Produce. However, that was not practical.

43:09
Q: Why Under-replicated partitions in the source cluster?
A: Too many producers (Hadoop mappers in this case)

45:00 True production fire-fighting. A producer bug prevented us from sending 1 MB size batches. Mirror-makers used to bail out unpredictably due to a compression estimation bug (KAFKA-3995).

46:45 CPU utilization of 4 mirror-makers in a well-behaved global Kafka pipeline. Monitoring 100s of such metrics is the most important prerequisite in running a successful Kafka pipeline.

And here comes the best comment…

49:35 “… running Kafka pipelines is like flying with a Jet engine. Look at the ratios. They don’t maintain rpms like car …”  I agree!

50:45 The end!

This talk would not have been possible without continuous help from Kafka SREs and the Venice team in Linkedin. See Kafka and Venice articles on the Linkedin Engineering Blog.