HDFS Erasure Coding (EC)

  1. Durability: How many simultaneous failure can be tolerated ? It is also known as fault tolerance.
  2. Storage Efficiency: How much portion of storage is useful for the data in real time?

Erasure Coding (EC)

  1. Contiguous Layout
  • Saving Storage: Initially, blocks are triplicated when they are no longer changed by any additional data, after this, a background task encode it into codeword and delete its replicas.
  • Two-way Recovery: HDFS block errors are discovered and recovered not only during reading the path but also we can check it actively in the background.
  • Low overhead: Overhead is reduced from 200% to just 50% in RS encoding algorithm.
  • Erasure coding puts additional demands on the cluster in terms of CPU and network.
  • Erasure coding adds additional overhead in the reconstruction of the data due to performing remote reads.
  • Erasure coding will be mostly used for warm storage data ( Fixed Data ), if the data is changing frequently we need to recalculate the parity each time.




Data Engineer

Karthik Sharma

Karthik Sharma

Data Engineer

