Understanding Parquet and its Optimization opportunities

Row vs Columnar storage format
Hybrid data storage format
Parquet primitive types
Parquet File Format
ParquetOutputFormat properties
Metadata of Parquet files
  1. Dictionary encoding:
  • Spark uses snappy as default.
  • Impala uses snappy as default.
  • Hive uses deflate codec as default.




Data Engineer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

planning (verb) : decide on and arrange in advance

The Only Guide You Need To Improve Programming Skill

Random Number Generation and Random Variate Generation

30-Day LeetCoding Challenge - 21

Kanban, the alternative path to agility

Unipolar Stepper Motor Driver Arduino

Paralysis Through Analysis — How To Balance Working On Projects vs. Learning Something New

Python Pro Tip: Want to use R/Java/C or Any Language in Python?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karthik Sharma

Karthik Sharma

Data Engineer

More from Medium

AWS Glue Streaming checkpoint + Amazon MSK ( Apache Spark, Kafka )

Big Data Processing: Most Time-Consuming Task

How poor provisioning of cloud resources can lead to 10X slower Apache Spark jobs

Read and write to S3 with Spark