Thursday 21 December 2017

Hadoop 3 works with HDFS Erasure Coding

Hadoop 3 add HDFS Erasure Coding

Well Hadoop 3 already available with improvements including support for HDFS Erasure coding, a preview of v2 of the YARN Timeline Service, and improvements to YARN/HDFS federation.

Now a day Hadoop is a framework is used to process large data sets across clusters of computers using simple programming models. 

The addition of HDFS erasure coding should make data more durable and to reduce the amount of storage needed for HDFS. 

The default three times replication scheme in HDFS has a 200 per cent  overhead in storage space and other resources such as network bandwidth.

For many datasets with relatively low I/O activities, additional block replicas are rarely accessed during normal operations, but still consume the same amount of resources as the first replica. If Erasure Coding is used in place of replication, the storage overhead is no more than 50 per cent. HDFS Erasure Coding uses RAID , in which Erasure Coding is implemented by stripping. This logically stores the data in the form of a block, and stores the block on the different disk. For each block, the parity is calculated and stored. This is the encoding, and any error can be recovered by back calculating using the parity.

The new release also includes a preview of the YARN Timeline Service v.2, which offers better scalability, reliability, and usability of the Timeline Service. The service is responsible for persisting application specific information, and for persisting generic information about completed applications.

Understand HDFS Erasure Coding 

Erasure Coding helps Capacity Utilization & Performance for Data Storage Systems

HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from.

However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space.

Therefore, a natural improvement is to use erasure coding (EC) in place of replication, which uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication


Advantages of Erasure Coding in Hadoop

  • Saving substantial space – Initially, blocks are triplicated when they are sealed (no longer modified), a background task encode it and delete it replicas.
  • Flexible policy – User and admin able to flag the file hot and cold. Hot files are replicated even after it sealed.
  • Fast Recovery – HDFS block errors are discovered and recovered both actively (in the background) and passively (on the read path).
  • Low overhead – Because of parity bit overhead is up to 50%.
  • Transparency/compatibility – HDFS user should be able to use all basic and advanced features on erasure coded data, including snapshot, encryption, appending, caching and so forth.

YARN Improvements

YARN is a framework for job scheduling and cluster resource management, and high availability for the HDFS filing system.

YARN federation is used to scale single YARN clusters to tens of thousands of nodes, by federating multiple YARN sub-clusters.

Support for YARN resource types has also been added, making it possible to schedule additional resources such as disks and GPUs for better integration with machine learning and container workloads.


Other improvements include the ability to federate YARN and HDFS subclusters transparently; and opportunistic container execution to improve resource utilization and increase task throughput for short-lived containers. Support for cloud storage systems such as Amazon S3  and Azure Data Lake has also been improved.

No comments:

Post a Comment