BigData Hadoop Corner: 2017

Thursday 21 December 2017

Hadoop 3 works with HDFS Erasure Coding

Hadoop 3 add HDFS Erasure Coding

Well Hadoop 3 already available with improvements including support for HDFS Erasure coding, a preview of v2 of the YARN Timeline Service, and improvements to YARN/HDFS federation.

Now a day Hadoop is a framework is used to process large data sets across clusters of computers using simple programming models.

The addition of HDFS erasure coding should make data more durable and to reduce the amount of storage needed for HDFS.

The default three times replication scheme in HDFS has a 200 per cent overhead in storage space and other resources such as network bandwidth.

For many datasets with relatively low I/O activities, additional block replicas are rarely accessed during normal operations, but still consume the same amount of resources as the first replica. If Erasure Coding is used in place of replication, the storage overhead is no more than 50 per cent. HDFS Erasure Coding uses RAID , in which Erasure Coding is implemented by stripping. This logically stores the data in the form of a block, and stores the block on the different disk. For each block, the parity is calculated and stored. This is the encoding, and any error can be recovered by back calculating using the parity.

The new release also includes a preview of the YARN Timeline Service v.2, which offers better scalability, reliability, and usability of the Timeline Service. The service is responsible for persisting application specific information, and for persisting generic information about completed applications.

Understand HDFS Erasure Coding

Erasure Coding helps Capacity Utilization & Performance for Data Storage Systems

HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from.

However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space.

Therefore, a natural improvement is to use erasure coding (EC) in place of replication, which uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication

Advantages of Erasure Coding in Hadoop

Saving substantial space – Initially, blocks are triplicated when they are sealed (no longer modified), a background task encode it and delete it replicas.
Flexible policy – User and admin able to flag the file hot and cold. Hot files are replicated even after it sealed.
Fast Recovery – HDFS block errors are discovered and recovered both actively (in the background) and passively (on the read path).
Low overhead – Because of parity bit overhead is up to 50%.
Transparency/compatibility – HDFS user should be able to use all basic and advanced features on erasure coded data, including snapshot, encryption, appending, caching and so forth.

YARN Improvements

YARN is a framework for job scheduling and cluster resource management, and high availability for the HDFS filing system.

YARN federation is used to scale single YARN clusters to tens of thousands of nodes, by federating multiple YARN sub-clusters.

Support for YARN resource types has also been added, making it possible to schedule additional resources such as disks and GPUs for better integration with machine learning and container workloads.

Other improvements include the ability to federate YARN and HDFS subclusters transparently; and opportunistic container execution to improve resource utilization and increase task throughput for short-lived containers. Support for cloud storage systems such as Amazon S3 and Azure Data Lake has also been improved.

Monday 18 December 2017

Apache Software Foundation announce Apache Hadoop 3.0.0 GA!

Wow Apache Hadoop reaches 3.0

!!Hadoop was born around 2007, and by 2017 its Part of Life!!

The Apache Software Foundation has announced version three of the open source software framework for distributed computing.

It incorporates over 6,000 changes since its started over a year ago of the open source software framework for distributed computing.
Apache Hadoop 3.0 is the first major release since Hadoop 2 was released in 2013.

“Hadoop 3 is a major milestone for the project, and our biggest release ever,” said Andrew Wang, Apache Hadoop 3 release manager. “It represents the combined efforts of hundreds of contributors over the five years since Hadoop 2. I’m looking forward to how our users will benefit from new features in the release that improve the efficiency, scalability, and reliability of the platform.”

Apache Hadoop has become known for its ability to run and manage data applications on large hardware clusters in the Big Data ecosystem. The latest release features HDFS erasure coding, a preview of YARN Timeline Service version 2, YARN resource types, and improved capabilities and performance enhancements around cloud storage systems. It includes Hadoop Common for supporting other Hadoop modules, the Hadoop Distributed File System, Hadoop YARN and Hadoop MapReduce.

“This latest release unlocks several years of development from the Apache community,” said Chris Douglas, vice president of Apache Hadoop. “The platform continues to evolve with hardware trends and to accommodate new workloads beyond batch analytics, particularly real-time queries and long-running services. At the same time, our Open Source contributors have adapted Apache Hadoop to a wide range of deployment environments, including the Cloud.”

Apache Hadoop is widely deployed in enterprises and companies like Adobe, AWS, Apple, Cloudera, eBay, Facebook, Google, Hortonworks, IBM, Intel, LinkedIn, Microsoft, Netflix and Teradata.

In addition, it has inspired other Hadoop related projects such as: Apache Cassandra, HBase, Hive, Spark and ZooKeeper.

All you want To Know About Apache Hadoop 3.0.0

!!!Apache Hadoop 3.0: Major changes!!

HDFS erasure coding

halves the storage cost of HDFS while also improving data durability;

Minimum required Java version increased to Java 8

All Hadoop JARs are now compiled targeting a runtime version of Java 8, which means that those of you who are still using Java 7 or below should upgrade to Java 8.

Early preview of YARN Timeline Service major revision

Hadoop 3.0 also brings an early preview (alpha 2) of a major revision of YARN Timeline Service: v.2, which addresses two major challenges:

improving scalability and reliability of Timeline Service
enhancing usability by introducing flows and aggregation

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. However, keep in mind that some changes could break existing installations.You’ll find the incompatible changes in the release notes, with related discussion on HADOOP-9902.

MapReduce task-level native optimization

MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.

Shaded client jars

The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.

Over the past decade, Apache Hadoop has become ubiquitous within the greater Big Data ecosystem by enabling firms to run and manage data applications on large hardware clusters in a distributed computing environment.

Default ports of multiple services have been changed.

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.

Apache Hadoop 3.0.0 highlights:

YARN Timeline Service v.2 —improves the scalability, reliability, and usability of the Timeline Service;
YARN resource types —enables scheduling of additional resources, such as disks and GPUs, for better integration with machine learning and container workloads;
Federation of YARN and HDFS subclusters transparently scales Hadoop to tens of thousands of machines;
Opportunistic container execution improves resource utilization and increases task throughput for short-lived containers. In addition to its traditional, central scheduler, YARN also supports distributed scheduling of opportunistic containers; and
Improved capabilities and performance improvements for cloud storage systems such as Amazon S3 (S3Guard), Microsoft Azure Data Lake, and Aliyun Object Storage System.

Hadoop 3.0.0 release Details:

After four alpha releases and one beta release, 3.0.0 is generally available. 3.0.0 consists of 302 bug fixes, improvements, and other enhancements since 3.0.0-beta1. All together, 6242 issues were fixed as part of the 3.0.0 release series since 2.7.0.

Here is series of alpha and beta releases leading up to an eventual Hadoop 3.0.0 GA.

3.0.0-alpha1	2016-09-03
3.0.0-alpha2	2017-01-25
3.0.0-alpha3	2017-05-26
3.0.0-alpha4	2017-07-07
3.0.0-beta1	2017-10-03
3.0.0 GA	2017-12-13

"Hadoop Is Here To Stay" —Forrester

Sunday 3 December 2017

Basic Postgres Command to list databases and list Schema

How to List Databases and Tables in PostgreSQL Using psql

First login to PostgreSQL using psql command and run following command from PostgreSQL command prompt.

psql -U postgres

PostgreSQL command line cheatsheet:

Some interesting flags (to see all, use -h):

-E: will describe the underlaying queries of the \ commands (cool for learning!)
-l: psql will list all databases and then exit (useful if the user you connect with doesn't has a default database, like at AWS RDS)

Most \d commands support additional param of __schema__.name__ and accept wildcards like *.*

\q: Quit/Exit
\c __database__: Connect to a database
\d __table__: Show table definition including triggers
\dt *.*: List tables from all schemas (if *.* is omitted will only show SEARCH_PATH ones)
\l: List databases
\dn: List schemas
\df: List functions
\dv: List views
\df+ __function__ : Show function SQL code.
\x: Pretty-format query results instead of the not-so-useful ASCII tables

Listing Databases

A single Postgres server process can manage multiple databases at the same time. Each database is stored as a separate set of files in its own directory within the server’s data directory. To view all of the defined databases on the server you can use the \list meta-command or its shortcut \l.

Listing Tables

Once you’ve connected to a database, you will want to inspect which tables have been created there. This can be done with the \dt meta-command. However, if there are no tables you will get no output.

postgres=# connect sampledb

sampledb=# dt

More Example Coming Soon..

Tuesday 3 January 2017

Fencing Method for ZK based HA in Hadoop

While enabling ZK quorum based HA via ZKFC service you can observed that default fencing method (i.e sshfence ) alone is not sufficient .Details about scenarios and fencing methods Discussed below:

dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover

The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

sshfence - SSH to the Active NameNode and kill the process

The sshfence option SSHes to the target node and uses fuser to kill the process listening on the service’s TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files. For example:

<name>dfs.ha.fencing.methods</name>

<value>sshfence</value>

</property>

<name>dfs.ha.fencing.ssh.private-key-files</name>

<value>/home/exampleuser/.ssh/id_rsa</value>

</property>

shell - run an arbitrary shell command to fence the Active NameNode

The shell fencing method runs an arbitrary shell command. It may be configured like so:

<name>dfs.ha.fencing.methods</name>

<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>

</property>

The string between ‘(’ and ‘)’ is passed directly to a bash shell and may not include any closing parentheses.

If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.

Note: This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (eg by forking a subshell to kill its parent in some number of seconds)

Scenario #1

If namenode process killed or Active Name Node rebooted

Then passive node using ZKFC process able to make passive namenode state ‘active’ using (sshfence)

ssh work fine which allow to fence active namenode

Scenario #2

If namenode shutdown/node cable unplugged/ major hardware/power failure

Then ssh will not happen and HA will never work .

SSH fail so fencing of active Namenode failed

Best Approach

To avoid Scenario #2 one can use both fencing method which will be tried by ZKFC in the order of: if the first one fails, the second one will be tried.

if ssh fencing failed then shell(/bin/true) will always true and fence active namenode.

Example:

<value>sshfence

shell(/bin/true)

</value>

Property TAG:

<name>dfs.ha.automatic-failover.enabled</name>

<name>dfs.ha.fencing.methods</name>

<value>sshfence

shell(/bin/true)

</value>

Tested this approach and it worked fine specially when Primary NN host down due to major hardware/power failure as mentioned.

BigData Hadoop Corner

Pages

Thursday 21 December 2017

Hadoop 3 works with HDFS Erasure Coding

Hadoop 3 add HDFS Erasure Coding

Understand HDFS Erasure Coding

Erasure Coding helps Capacity Utilization & Performance for Data Storage Systems

Advantages of Erasure Coding in Hadoop

YARN Improvements

Monday 18 December 2017

Apache Software Foundation announce Apache Hadoop 3.0.0 GA!

Wow Apache Hadoop reaches 3.0

!!Hadoop was born around 2007, and by 2017 its Part of Life!!

The Apache Software Foundation has announced version three of the open source software framework for distributed computing.

All you want To Know About Apache Hadoop 3.0.0

!!!Apache Hadoop 3.0: Major changes!!

HDFS erasure coding

Minimum required Java version increased to Java 8

Early preview of YARN Timeline Service major revision

Shell script rewrite

MapReduce task-level native optimization

Shaded client jars

Default ports of multiple services have been changed.

Apache Hadoop 3.0.0 highlights:

Hadoop 3.0.0 release Details:

Sunday 3 December 2017

Basic Postgres Command to list databases and list Schema

How to List Databases and Tables in PostgreSQL Using psql

First login to PostgreSQL using psql command and run following command from PostgreSQL command prompt.

Listing Databases

Listing Tables

Tuesday 3 January 2017

Fencing Method for ZK based HA in Hadoop

sshfence - SSH to the Active NameNode and kill the process

shell - run an arbitrary shell command to fence the Active NameNode

Total Pageviews