Understanding Parquet and its Optimization opportunities

Introduction to Parquet

Apache Parquet is a columnar open source storage format that can efficiently store nested data which is widely used in Hadoop and Spark. Initially developed by Twitter and Cloudera. Columnar formats are attractive since they enable greater efficiency, in terms of both file size and query performance. File sizes are usually smaller than row-oriented equivalents since in a columnar format the values from one column are stored next to each other, which usually allows a very efficient encoding. Query performance is improved too since a query engine can skip over columns that are not needed to answer a query.

Row vs Columnar storage format

OLTP workloads uses row format since preference is given to Data Manipulation, whereas OLAP workloads uses columnar format.

Parquet is referred as columnar format in many books but internally it is like an hybrid format ( Combination of Row and Columnar ).

Hybrid data storage format

Data Formats in Parquet

Parquet defines a small number of primitive types.

Parquet primitive types

The data stored in a Parquet file is described by a schema, which has at its root a message containing a group of fields. Each field has a repetition (required, optional, or repeated), a type, and a name.

Here is a simple Parquet schema for a weather record:

message WeatherRecord {
required int32 year;
required int32 temperature;
required binary stationId (UTF8);
}

Notice that there is no primitive string type. Instead, Parquet defines logical types that specify how primitive types should be interpreted, so there is a separation between the serialized representation (the primitive type) and the semantics that are specific to the application (the logical type). Strings are represented as binary primitives with a UTF8 annotation. Complex types in Parquet are created using the group type, which adds a layer of nesting.

Parquet File Format

A Parquet file consists of a header followed by one or more blocks, terminated by a footer. The header contains only a 4-byte magic number, PAR1, that identifies the file as being in Parquet format, and all the file metadata is stored in the footer. The footer’s metadata includes the format version, the schema, any extra key-value pairs, and metadata for every block in the file. The final two fields in the footer are a 4-byte field encoding the length of the footer metadata, and the magic number again (PAR1).

Parquet File Format

Each block in a Parquet file stores one or more row groups, which is made up of column chunks containing the column data for those rows. The data for each column chunk is written in pages. Each page contains values from the same column, making a page a very good candidate for compression since the values are likely to be similar. The first level of compression is achieved through how the values are encoded. There are different encoding techniques ( Simple encoding, Run-Length encoding, Dictionary encoding, Bit Packing, Delta encoding ) which we will discuss in a separate blog.

Parquet Configuration

Parquet file properties are set at write time. The properties listed below are appropriate if you are creating Parquet files from Map Reduce, Crunch, Pig, or Hive.

ParquetOutputFormat properties

The Parquet file block size should be no larger than the HDFS block size for the file so that each Parquet block can be read from a single HDFS block (and therefore from a single datanode). It is common to set them to be the same, and indeed both defaults are for 128 MB block sizes.

A page is the smallest unit of storage in a Parquet file, so retrieving an arbitrary row requires that the page containing the row be decompressed and decoded. Thus, for single-row lookups, it is more efficient to have smaller pages, so there are fewer values to read through before reaching the target value.

Tool to check the meta information related to Parquet File

In Hadoop distribution we have a tool called parquet-tools which will help us to fetch the meta info related to the parquet file.

In above mentioned command we are trying to get both meta data and content of the parquet file. If we watch to fetch the info related to any file in HDFS location then we need to ensure that entire HDFS location is mentioned ( Including the host name ) and Firewall is open. If it is not working copy that parquet file to local and perform the operations.

Metadata of Parquet files

The one colored in blue provides the row group number, in a parquet file there an multiple blocks which contains multiple row groups. The one colored in green provides the row count of that particular row group.

In each row group, we are provide information related to each column. The first one describes the data type, followed by compression type ( in above example it is SNAPPY ), followed by column chuck size SZ ( compressed/uncompressed/ratio ), followed by encoding techniques and statistics( Min, Max, Distinct values ).

Optimization opportunities using Parquet format

  1. Dictionary encoding:

Smaller files means there will be less I/O involved. Dictionary encoding will ensure that there is improvement in storage and accessing. For one column chunk there will be single dictionary. Most types are encoded using dictionary en‐ coding by default; however, a plain encoding will be used as a fallback if the dictionary becomes too large. The threshold size at which this happens is referred to as the dictionary page size and is the same as the page size by default. Please refer to parquet configuration section for more information. One can validate whether the file is dictionary encoded by using the parquet-tools.

In order to perform better we need to decrease the row group size and increase the dictionary page size.

2. Page Compression:

The below default compression schemes while using the Parquet format.

  • Spark uses snappy as default.
  • Impala uses snappy as default.
  • Hive uses deflate codec as default.

Using snappy compression will reduce the size of the page and improve read time.

Using Parquet format allows for better compression, as data is more homogeneous. The space savings are very noticeable at the scale of a Hadoop cluster. I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input. As we store data of the same type in each column, we can use encoding better suited to the modern processors’ pipeline by making instruction branching more predictable. Parquet format is mainly used for WRITE ONCE READ MANY applications.

I hope this blog helped you in understanding the parquet format and internal functionality. Happy Learning!!!

Data Engineer at Legato

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store