Hadoop

 

Introduction

Apache Hadoop is an open-source software framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single server to thousands of servers, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a running cluster of computers, each of which may be prone to failures.It helps in handling larger volume of data with minimum failure

History

Intially hadoop was f conceived to fix a scalability issue , an open source crawler and search engine. At that time Google had published papers about the Google File System (GFS), and Map-Reduce, a computational framework for parallel processing. Development started in the Apache Nutch project with the successful implementation of these papers. But in  2006 Apache Nutch project was moved to the new Hadoop subproject Doug Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.


Hadoop

Hadoop is a distributed master-slave architecture that consists of the following primary components:

  • Storage unit– HDFS (NameNode, DataNode)
  • Processing framework– YARN (ResourceManager, NodeManager)
  • Hadoop Distributed File System (HDFS) for data storage.
  • Yet Another Resource Negotiator (YARN), a general purpose scheduler and resource manager.
  • MapReduce, a batch-based computational engine. MapReduce is implemented as a YARN application.

HDFS

HDFS is the storage component of Hadoop. It’s a distributed filesystem that’s modeled after the Google File System (GFS) paper.4 HDFS is optimized for high throughput and works best when reading and writing large files (gigabytes and larger). To support this throughput, HDFS uses unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O).

Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance. Hadoop 2 introduced two significant new features for HDFS—Federation and High Availability (HA):

  • NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.
  • DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.
  • Federation allows HDFS metadata to be shared across multiple NameNode hosts, which aides with HDFS scalability and also provides data isolation, allowing different applications or teams to run their own NameNodes without fear of impacting other NameNodes on the same cluster.
  • High Availability in HDFS removes the single point of failure that existed in Hadoop 1, wherein a NameNode disaster would result in a cluster outage. HDFS HA also offers the ability for failover (the process by which a standby Name-Node takes over work from a failed primary NameNode) to be automated.

HDFS Commands

        Click this Link:  HDFS Commands                                 

YARN

Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well.

YARN’s architecture is simple because its primary role is to schedule and manage resources in a Hadoop cluster. The core components in YARN: the ResourceManager and the NodeManager. YARN separates resource management and processing components.

Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc. YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.

YARN has central resource manager component which manages resources and allocates the resources to the application. Multiple applications can run on Hadoop via YARN and all application could share common resource management.

  • ResourceManagerIt receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
  • NodeManagerNodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.

MAPREDUCE

MapReduce is a batch-based, distributed computing framework modeled after Google’s paper on MapReduce. It allows you to parallelize work over a large amount of raw data. The MapReduce model simplifies parallel processing by abstracting away the complexities involved in working with distributed systems, such as computational parallelization, work distribution, and dealing with unreliable hardware and software. With this abstraction, MapReduce allows the programmer to focus on addressing business needs rather than getting tangled up in distributed system complications.

  • MapReduce consists of two distinct tasks – Map and Reduce.
  • As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been completed.
  • So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
  • The output of a Mapper or map job (key-value pairs) is input to the Reducer.
  • The reducer receives the key-value pair from multiple map jobs.
  • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

Hadoop distributions

Hadoop is an Apache open source project, and regular releases of the software are available for download directly from the Apache project’s website (http://hadoop.apache.org/releases.html#Download). You can either download and install Hadoop from the website or use a commercial distribution of Hadoop, which will give you the added benefits of enterprise administration software, a support team to consult.

Apache

Apache is the organization that maintains the core Hadoop code and distribution. the challenge with the Apache distributions has been that support is limited to the goodwill of the open source community, and there’s no guarantee that your issue will be investigated and fixed. Having said that, the Hadoop community is a very supportive one, and responses to problems are usually rapid.

Cloudera

CDH (Cloudera Distribution Including Apache Hadoop) is the most tenured Hadoop distribution, and it employs a large number of Hadoop (and Hadoop ecosystem) committers. Doug Cutting, who along with Mike Caferella originally created Hadoop, is the chief architect at Cloudera. In aggregate, this means that bug fixes and feature requests have a better chance of being addressed in Cloudera compared to Hadoop distributions with fewer committers.

Hortonaworks

Hortonworks Data Platform (HDP) is also made up of a large number of Hadoop committers, and it offers the same advantages as Cloudera in terms of the ability to quickly address problems and feature requests in core Hadoop and its ecosystem projects. Hortonworks is also the main driver behind the next-generation YARN platform, which is a key strategic piece keeping Hadoop relevant.

Cloudera Hortonworks Merger 
January 3, 2019  the enterprise data cloud company, today announced completion of its merger with Hortonworks.Cloudera and Hortonworks have announced they are merging. ... Knowing this, there must have been a strong driver that forced Cloudera and Hortonworks together

No comments:

Post a Comment