Understanding Parquet and its Optimization opportunities

Row vs Columnar storage format
Hybrid data storage format
Parquet primitive types
Parquet File Format
ParquetOutputFormat properties
Metadata of Parquet files
  1. Dictionary encoding:
  • Spark uses snappy as default.
  • Impala uses snappy as default.
  • Hive uses deflate codec as default.

--

--

--

Data Engineer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Stay Alert To Any Variation In The Price Of Brent Oil With This API

Bike riding by Machine Learning

Beating the Bears — In Crypto Space

Rare Meme Contest! — July 2021

Testnet Guide — KYVE

Deploying a .NET 5 Web API and a SQL Database in Azure

What is Numpy and why we use it instead of python’s Built-in Lists?

The best courses for voice app and chatbot development

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karthik Sharma

Karthik Sharma

Data Engineer

More from Medium

Spark Structured Streaming — Fault Tolerance

BENEATH A ‘SPARK’ PROGRAM

Using Airflow and Spark operator to Add Partitions to Hive Metastore

How poor provisioning of cloud resources can lead to 10X slower Apache Spark jobs