In this blog we are going to discuss about how to integrate Apache Kafka with Spark using Python and its required configuration.

How Kafka works ?

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.



Introduction to Parquet

Apache Parquet is a columnar open source storage format that can efficiently store nested data which is widely used in Hadoop and Spark. Initially developed by Twitter and Cloudera. Columnar formats are attractive since they enable greater efficiency, in terms of both file size and query performance. File sizes are usually smaller than row-oriented equivalents since in a columnar format the values from one column are stored next to each other, which usually allows a very efficient encoding. …

Karthik Sharma

Data Engineer at Legato

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store