In this blog we are going to discuss about how to integrate Apache Kafka with Spark using Python and its required configuration.
How Kafka works ?
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.
Introduction to Parquet
Apache Parquet is a columnar open source storage format that can efficiently store nested data which is widely used in Hadoop and Spark. Initially developed by Twitter and Cloudera. Columnar formats are attractive since they enable greater efficiency, in terms of both file size and query performance. File sizes are usually smaller than row-oriented equivalents since in a columnar format the values from one column are stored next to each other, which usually allows a very efficient encoding. …