Comparison: Kafka or Event Hubs connector to consume streaming data from DataBricks in IoT scenario

Masayuki Ota
2 min readJul 28, 2020

Expected reader and outcome from this article

  • Expected reader: Software engineer/Data engineer who uses Azure IoT and Spark technologies
  • Outcome: Understand one of the important difference between Kafka/Event Hubs connector

Motivation

When I design and develop application in manufacturing industry IoT scenario, our team sometimes use DataBricks (Spark) to consume streaming data from sensors.

I uses Azure IoT Hub for managing and receiving data in cloud side. It has Event Hubs compatible endpoint. Event Hubs is compatible with Apache Kafka. Therefore we have 2 options to consume streaming data from IoT Hub on Azure DataBricks. I want to decide which connector I use based on understanding difference.

Conclusion

One of the important difference between these connector is that Event Hubs connector can consume both message properties and message body, but Kafka can consume only message body. Therefore I prefer to use Event Hubs connector than Kafka connector.

When a developer develops device application with Azure IoT Hub Device SDK, it enable us to set 2 kinds of message properties.

  • System Property: Defined by IoT Hub automatically. It includes device id assigned in Azure IoT Hub. You may want to group-by aggregation with the device id.
  • Application property: Property a developer can set as he/she like. For example, the developer can set Alert Boolean property, then Spark application can switch its process.

Detail and working codes

Prerequisites

Consume streaming data with Event Hubs connector

Please see line 1 and 2 for preparing expected cluster and installing required library, then run this in your Databricks notebook

You can see properties (application property) and systemProperties in the streaming data. Of course, you can consume message body as body column.

Consume streaming data with Kafka connector

Please see line 1 and 2 for preparing expected cluster and installing required library, then run this in your Databricks notebook

You don’t consume message properties in the streaming data, but can consume message body as value column.

--

--