ETL Streaming with Kafka

5 Min Read | In this blog, we’ll try to explain how to use Kafka as an ETL tool. Previously at Xoxoday, we used batch processing to populate our backend reports, but with our…

5 Min Read |

In this blog, we’ll try to explain how to use Kafka as an ETL tool. Previously at Xoxoday, we used batch processing to populate our backend reports, but with our growing customer base, it became imperative that they need more than just batch processing. Reporting of events as soon as they happen in real time can enable our clients as well as backend teams to make decisions instantly. 

So first thing first on how Kafka is related to databases. An interesting article by the founder of Kafka, Jay Krepps, puts light on why it is okay to store data in Kafka[1]. It is like configuring the Kafka topics to not purge records and retain the data perpetually like we do in any data storage applications. Further, Kafka is ACID compliant as Martin Klepmann explains[2], that means Kafka emulates most of the properties of any database systems. Sounds interesting right? But the story is not over yet. Kafka can work on top of many data formats: JSON, CSV, AVRO, Protobuf etc. which are industry standards, and in a microservice environment where each service play with its own rules and data formats, it is easy to source data from various services and sink the data to one common endpoint e.g data-warehouse. Kafka comes with a plethora of plugins which makes maintaining, processing and transforming data inside Kafka ecosystem itself a lot easier. 

Kafka provides high throughput, is easily scalable, and is cheap since it mainly relies on harddisk. There are tons of articles on various benefits of Kafka and its use cases[3]

Kafka has already been playing an important role in our architecture, and building ETL on top of Kafka became apparently easier as it needed minimal investment in learning and infrastructure setup.

Extraction

For analytics and reporting we embraced Kafka-connect as our de facto choice to extract data from various data sources, be it CouchDB, MySQL or csv flat files. Kafka connect is a Kafka framework that makes it simple to quickly define connectors to move large collections of data into and out of Kafka[4]. Kafka connect provides a various configuration option which can come handy for scalability and monitoring[5]. In our environment, we are using a debezium plugin on top of Kafka Connect to source data from various MySQL databases. Change data capture[6] is a design pattern that records insert, update, and delete activity that is applied to a SQL Server table. Debezium captures changed data and publishes in JSON/AVRO format to Kafka topics as per our configuration in Kafka connect. Fig 1 below emulates our Extraction data pipeline.

Fig 1: Kafka connect can read data from multiple sources and publish them to Kafka Topic

Transformation

In a normal data-warehouse architecture we generally store data either in dimensions table or measurable fact tables. So the basic challenge one may face, while designing ETL on top of Kafka, is to model the data pipeline differently than the traditional approach of batch processing. Kafka stores data as a sequence of events, that means going back in history to look up some changes may look a bit tricky but Kafka stream comes to our rescue. 

K streams and K tables are two important concepts in Kafka streams. K stream is an abstraction of record streams where each record is some event in the real world and Ktable is like a database table which only stores the latest state of any entity. Changed source data is populated in a topic, which can then be pushed to a K stream. Later the stateless records in K stream can be enriched using K tables in real time and published back to another topic. Let’s look at an example of how K stream and Ktable can help us in enriching the information.

Say our sales representative is making sales in City A in the morning and in City B in the afternoon. We can maintain the current location state of the sales representative inside a Ktable. As the sales happen, the records are getting populated in Kstreams sequentially. To see if sales happened in City A and City B, the events/records can be enriched using the active state in the Ktable. The mantra here is to visualize each event in the real world as a sequence. 

For data transformation, we are using confluent KSQL[7]. KSQL is a SQL styled language developed by confluent for backend SQL developers who are oblivious of Java. There are still some limitations with KSQL but not those major ones to stop us. For example: playing with nested jsons, multiple joins are still not supported. These limitations can be worked around using various tools like jq or languages nodejs. In our environment, we used jq to flatten out nested json to make them ready to be processed by KSQL. In the case of multiple joins, we achieve the desired outcome by using one join at a time.

Fig 2: Transformation of data from input topics to output topics. 

Fig 2 above explains a few of the transformations possible using Kafka Streams. We can merge K stream and K stream using windowed concept. We can transform nested JSON to Delimited output which can be pushed to a relational table. Also, we can enrich a record in a KStream using JOIN with Ktable.

Load

The transformed data available in output topics, as in Fig 2, is then ready to be consumed by the consumers to upload into our denormalized/flat fact tables. In some cases, change of data may not be sequentially defined and in such cases, we may need to modify our data-pipeline where we can use stored procedures to transform such events. For example: In our Sales representative case, the department change and designation change of our Sales representative may not be sequentially defined. If report requirement is to update historical data on any change in department or designation, then we may need the help of database stored procedure to remap historical data to the new state.

Summary

In the article, we tried to explain how Kafka is so closely related to the database. Then we tried to look into how Kafka is capable to perform all three parts of the ETL process. If the requirement is to design a real-time reporting, Kafka makes perfect sense. With it is streaming capabilities Kafka can source in the huge volume of data as they happen in real time from different sources, transform it with minimal latency and publish to the endpoint at high throughput. 

What’s next

In a future blog, we’ll try to look into the various transformation that can be achieved using KSQL.

References

  1. https://www.confluent.io/blog/okay-store-data-apache-kafka/
  2. https://www.youtube.com/watch?time_continue=3&v=v2RJQELoM6Y
  3. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  4. https://kafka.apache.org/documentation/#connect
  5. https://docs.confluent.io/current/connect/managing/monitoring.html
  6. https://en.wikipedia.org/wiki/Change_data_capture
  7. https://www.confluent.io/product/ksql/