HDFS Erasure Coding (EC)

  1. Durability: How many simultaneous failure can be tolerated ? It is also known as fault tolerance.
  2. Storage Efficiency: How much portion of storage is useful for the data in real time?

Erasure Coding (EC)

  1. Contiguous Layout
  • Saving Storage: Initially, blocks are triplicated when they are no longer changed by any additional data, after this, a background task encode it into codeword and delete its replicas.
  • Two-way Recovery: HDFS block errors are discovered and recovered not only during reading the path but also we can check it actively in the background.
  • Low overhead: Overhead is reduced from 200% to just 50% in RS encoding algorithm.
  • Erasure coding puts additional demands on the cluster in terms of CPU and network.
  • Erasure coding adds additional overhead in the reconstruction of the data due to performing remote reads.
  • Erasure coding will be mostly used for warm storage data ( Fixed Data ), if the data is changing frequently we need to recalculate the parity each time.

--

--

--

Data Engineer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Accessing the U.S. Energy Information Administration data using REST APIv2

Ocular Disease Recognition using CNNs.

7 Women You Should Be Following on LinkedIn

Flatiron School First Data Science Project

5 Computer Vision and Deep Learning Fundamentals

Normalized Discounted Cumulative Gain

Speech Analytics: data science stream becoming fast popular

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karthik Sharma

Karthik Sharma

Data Engineer

More from Medium

Introduction to programming

Accidentally commit to the master branch when coding.

CS373 Spring 2022: John Powers

CS373 Spring 2022: Kristina Zhou