Kafka in Industrial IoT

Sensors and smart data analysis algorithms are key to any Internet of Things (IoT) system. But they are not everything. The conduit that moves vast corpuses of data from sensors to data analysis systems is equally (if not more) important that makes the whole wheel spin. I.e., data movement is the life-blood of IoT systems.

Industrial IoT

Industrial IoT—the big brother of consumer IoT—is about collecting and analyzing data from machines, buildings, factories, power grids, hospitals and such. Like any other IoT system, IIoT systems need technologies to transport data reliably, securely, efficiently, and scalably (if there is a word like that). There are plethora of technologies out there that claim to be the right choice for the data movement: MQTT, CoAP, OPC, OPC-UA, DDS, Alljoyn, IoTivity to name just a few.

If you step back from this alphabet-soup and try to distill what these technologies are trying to achieve, common architectural patterns begin to emerge. These patterns have been well-documented previously. Specifically, the Industrial Internet Consortium (IIC) has published a reference architecture that describes three architectural patterns.

  • Three-tier architecture pattern
  • Gateway-Mediated Edge Connectivity and Management architecture pattern
  • Layered Databus pattern

I’ll pick the Layered Databus pattern for further discussion here because it’s a more specific instance of the three-tier architecture pattern, I’m more familiar with it, and it strongly resembles the architecture of the applications that use Kafka.

The Layered Databus Pattern

three-layer-databus-arch

What’s a databus anyway? And why does it rhyme so closely with a database?

The reference architecture defines it very generally as

“… a logical connected space that implements a set of common schema and communicates using those set of schema between endpoints.”

Then come the layers. The purpose of the layers is

“[to] federate these systems into a system-of-systems [to] enable complex, internet-scale, potentially-cloud-based, control, monitoring and analytic applications.”

That makes sense. The scope of the “databus”widens as you move up the hierarchy. Each databus layer has it’s own data-model (a fancy word for collection of schemas/types). The raw data is filtered/reduced/transformed/merged at the boundary where the gateways are. The higher layers are probably not piping through all the data generated at the lower layers. Especially in cases such as wind turbines that generate 1 terabyte of data per week per turbine.

OMG Data Distribution Service (DDS) is a standard api to design and implement Industrial IoT systems using this architectural style. It’s a good way to implement a databus. DDS brings in many unique aspects (e.g., declarative QoS model, < 100µs latency, interoperability) into the mix that is way beyond the scope of this article. It’s perhaps sufficient to say that I’ve seen the evolution of the Layered Databus pattern in the IIC reference architecture very closely during my 5+ years tenure at Real-Time Innovations, Inc.

So far so good…

What’s in it for Kafka, anyway?

I submit that one of the best technologies to implement the higher layers of “databus” is Kafka.

Kafkaesque IIoT Architecture

Kafka is, by definition, a “databus”.  Kafka is a scalable, distributed, reliable, highly-available, persistent, broker-based, publish-subscribe data integration platform for connecting disparate systems together. The publishing and consuming systems are decoupled in time (they don’t have to be up at the same time), space (they are located at different places), and consumption rate (consumers poll at their own pace). Kafka is schema-agnostic as it does not care about the data encoder/decoder technology used to transmit data. Having said that, you are much better off adopting a well-supported serde technology such as Avro, Protobuf and manage the collection and evolution of schemas using some sort of schema-registry service.

Data-Centric Kafka: Log-Compaction

According to the IIC reference architecture, a data-centric pub-sub model is “central to the databus”. A data-centric pub-sub model focusses on

  1. Defining unique instances of data sources
  2. Managing the lifecycle of the instances
  3. Propagation of modifications to the instance attributes as instance updates (not just a message).

It’s conceptually similar to a database table where each row has a primary key that determines the “instance”. The main difference is that the column values of rows (independent instances) change frequently and the middleware communicates them as first-class instance updates. The instance lifecycle events (e.g., creation of an instance, deletion of an instance) are first-class and can be observed. Note that data-centric pub-sub in no way requires a database.

Kafka has built-in abstractions for data-centric pub-sub communication model. They are called log-compacted topics. Nominally, a Kafka topic with a finite retention time, key-value pairs are deleted when it’s time. In a log-compacted topic, on the other hand, only a single value survives for a given key. Only the latest value of  given key-value pair survives until the tombstone for the key is written. Nominal retention isn’t applicable anymore. Only a tombstone deletes the key from the partition. The consumers can recognize the creation of a new key and deletion of an existing key via the consumer API. As a result, Kafka is fully equipped for data-centric communication model.

Any consumer that is subscribed to the log-compacted topic from the time topic is created is able to see all updates to the keys produced into the topic. This is because there’s a finite delay before Kafka selects keys for compaction. A caught-up consumer observes all the changes to all the keys.

A consumer that subscribes long after the creation of log-compacted topic only observes the final value of keys that have been compacted. Any such consumer given enough time catches up to the head of the topic and from that point on observes all the updates to the subsequently produced keys.

This feature enables many use-cases.

  • The first one is database replication. Large internet companies like Linkedin have moved away from the native replication technology of say Oracle/MySQL replication and have replaced them with incremental data capture pipelines using Kafka.
  • The second use-case is backing up local key-value stores of stream-processing jobs. The processing state of Samza/Kafka-Streams jobs is periodically check-pointed locally and to Kafka log-compacted topics as a backup.

Data-Centric Kafka All the Way

There are two key aspects that makes Kafka a great choice for the layered databus architectural pattern.

  1. Free (as in Freedom) Keys
  2. KafkaStreams for Dataflow-oriented Application Implementation

Free (as in Freedom) Keys

Kafka’s notion of a record (a.k.a message) is a key-value pair, where key and value are orthogonal to each other. The key, even if it’s just a subset of attributes in the value (e.g., a userId, sensorId, etc.), is separate from the value. The key could contain attributes outside the value altogether. The key structure (type) is in no way statically bound to the value structure (type). The key attributes can grow/shrink independently of value should the need arise.

This flexibility allows application designers to publish a stream of values to different topics each with a potentially different key type. This could also be done in a serial pipeline where an application reads a log-compacted topic with (k1, v1) type and republishes it with (k2, v1) type. I.e., Kafka accommodates logical repartitioning of the dataspace to suite the target application semantics.  (Note: The logical repartitioning is not to be confused with with physical repartitioning by increasing/decreasing the number of partitions. The former is about domain-level semantics, while the later is about managing parallelism and scalability.)

KafkaStreams for Dataflow-oriented Application Implementation

The data-centric principles should not be limited to just the communication substrate. If the consuming application is structured around the notion of logically parallel, stateful transformations of data in a pipeline (i.e., a flow) and the infrastructure handles state management around the core flow abstraction, there is much to gain from such an application design. First, the application can scale naturally with the increasing number of simultaneously active keys (e.g., growing number of sensors). It can also scale easily with the increasing data volume and update velocity (e.g., growing resolution/accuracy of the sensors). Scaling for both dimensions is achieved by adding processing nodes and/or threads because the dataflow design admits concurrency very easily. Managing growing scale requires effective strategies for distributed data partitioning and multi-threading on each processing node.

Not coincidently, dataflow-oriented stream processing libraries have become increasingly popular due to improved level of abstraction, ease of use, expressiveness, and their support for flexible scale-out strategies. Dataflow-oriented design and implementation has been shown effective in easily distributing the workload across multiple processing nodes/cores. There are many options including Apache Beam, Apache SamzaApache Spark, Apache Flink, Apache Nifi, and Kafka Streams for Kafka.

Check out how Kafka Streams relates to Global Data Space.

The most touted middleware for implementing the layered databus pattern, DDS, supports dataflow-oriented thinking at the middleware level but the API makes no attempt to structure the applications around the idea of instance lifecycle. Having said that, Rx4DDS is a research effort that attempts to bridge the gap and enforce data-centric design all the way to the application layer.

Once again, Kafka comes to rescue with Kafka Streams which includes high-level composable abstractions such as KStream, KTable, and GlobalKTable. By composable, I mean it’s an API DSL with fluent interface. KStream is just a record stream. KTable and GlobalKTable are more interesting as they provide a table abstraction where only the last update to a key-value pair is visible. It’s ideal for consuming log-compacted topics. Application-level logic is implemented as transformations (map), sequencing (flatMap), aggregations (reduce), grouping (groupBy), and joins (product of multiple tables). Application state management (snapshots, failure-recovery, at-least once processing) is delegated to the infrastructure.

These capabilities are by no means unique to Kafka Streams. The same ideas surface in number previously mentioned stream processing frameworks. The point, therefore, is that Kafka together with stream processing frameworks provide an excellent alternative to implement the layered databus pattern.

Kafka Pipelines for Federation (aka Layers)

It does not take a rocket scientist to realize that the “Layered Databus” is just a geographically distributed Kafka pipeline (or pipelines). Yeah, that’s everyday business for Kafka.

A common purpose of deploying Kafka pipelines in Internet companies is tracking. User activity events captured at front-end applications and personal devices are hauled to offline systems (e.g., Hadoop) in near realtime. The tracking pipelines commonly involve “aggregate” clusters that collect data from all regional clusters. The regional clusters are deployed world-wide. Kafka Mirror-makers are used to replicate data from one cluster to the another over long latency links. It’s essentially a federation of Kafka clusters for data propagation in primarily one direction. The other direction is pretty similar.

Here’s how the same architecture can be applied to an abstract IIoT scenario.

kafka-pipeline-iiot

This picture shows how Kafka can be used to aggregate data from remote industrial sites.

A gateway publishes data from machines to a local Kafka cluster where it may be retained for a few days (configurable). The mirror-maker cluster in the data-centers securely replicate the source data into the aggregate clusters. Large volumes of data can be replicated efficiently by using built-in compression in Kafka. Enterprise apps consume data from the aggregate clusters.

A pipeline in reverse direction could be used for the control data. The mirror-maker cluster is separate due to direction and different latency requirements. The control pipeline in the replica data-center is passive.

Interestingly, a similar architecture is used by British Gas in production.

Kafka offers a number of advantages in this case.

  • Aggregate clusters support seamless failover in the face of regional unavailability.
  • Kafka and mirror-makers support no-loss data pipelines.
  • There’s no need of a separate services for durability as Kafka itself is a reliable, persistent commit log.
  • With growing scale (more topics, more keys, larger values, faster values, etc), more instances of Kafka brokers and mirror-maker can be easily added to meet the throughput requirements.
  • Identity mirroring maintains the original instance lifecycle across the pipeline. Identity mirroring means that the source and destination clusters have same # of partitions and keys that belong in partition p in the source cluster also belong to partition p in the destination cluster. Order of records is guaranteed.
  • Whitelisting and blacklisting of topics in mirror-maker allows only a subset of data to pass through.
  • Replication is secured using SSL.
  • Kafka is designed for operability and multi-tenancy. Each Kafka cluster can be monitored as Kafka publishes numerous metrics (as JMX Mbeans) for visibility, troubleshooting, and alerting during infrastructure problems.

Summary

Kafka (the infrastructure) provides a powerful way to implement the layered databus architecture (a.k.a system-of-systems) for Industrial IoT. Kafka Streams (the API) provides first-class abstractions for dataflow-oriented stream processing that lends itself well to the notion of data-centric application design. Together, they allow the system architects and other key stake-holders to position their assets (apps and infrastructure) for future growth in unique sensors, data volume/velocity, and analysis dimensions while maintaining a clean dataflow architecture.