Hi Friends,
Here I am sharing about the overview of "Hadoop" framework and related projects which is necessary to improve the performance and maintains scalability for Big data. In future, I'll be adding how this stuffs are more useful for the Mobile end-client and computing in detail. Also, I have exposed here how Android smartphone does play with Hadoop concepts in practical.
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.
"Hadoop is a framework for running
applications on large clusters of commodity hardware. The Hadoop
framework transparently provides applications both reliability and data
motion. Hadoop implements a computational paradigm named map/reduce,
where the application is divided into many small fragments of work, each
of which may be executed or re-executed on any node in the cluster. In
addition, it provides a distributed file system that stores data on the
compute nodes, providing very high aggregate bandwidth across the
cluster. Both map/reduce and the distributed file system are designed so
that node failures are automatically handled by the framework."
Architecture
Hadoop consists of the Hadoop Common, which provides access to the filesystems supported by Hadoop. The Hadoop Common package contains the necessary JAR
files and scripts needed to start Hadoop. The package also provides
source code, documentation, and a contribution section which includes
projects from the Hadoop Community.
For effective scheduling of work, every Hadoop-compatible filesystem
should provide location awareness: the name of the rack (more precisely,
of the network switch) where a worker node is. Hadoop applications can
use this information to run work on the node where the data is, and,
failing that, on the same rack/switch, so reducing backbone traffic. The
Hadoop Distributed File System (HDFS) uses this when replicating data,
to try to keep different copies of the data on different racks. The goal
is to reduce the impact of a rack power outage or switch failure so
that even if these events occur, the data may still be readable.
- A small Hadoop cluster will include a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes, and compute-only worker nodes; these are normally only used in non-standard applications. Hadoop requires JRE 1.6 or higher.
- In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the filesystem index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thus preventing filesystem corruption and reducing loss of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate filesystem, the NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the filesystem-specific equivalent.
Meet Hadoop:
We live in the data age. It’s not easy to measure the total volume of data stored elec-
tronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes
in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is
1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion
terabytes. That’s roughly the same order of magnitude as one disk drive for every person
in the world.
This flood of data is coming from many sources. Consider the following:
This flood of data is coming from many sources. Consider the following:
-
The New York Stock Exchange generates about one terabyte of new trade data per
day.
-
Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
-
Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
-
The Internet Archive stores around 2 petabytes of data, and is growing at a rate of
20 terabytes per month.
-
The Large Hadron Collider near Geneva, Switzerland, will produce about 15
petabytes of data per year.etc,
So there’s a lot of data out there. But you are probably wondering how it affects you.
Most of the data is locked up in the largest web properties (like search engines). The good news is that Big Data is here. The bad news is that we are struggling to store
and analyze it.
Although Hadoop is best known for MapReduce and its distributed filesystem
(HDFS, renamed from NDFS), the other subprojects provide complementary services,
or build on the core to add higher-level abstractions. The subprojects, and where they
sit in the technology stack,
and described briefly here:
- Core-A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).
- Avro- A data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)
- MapReduce- Hadoop's fundamental data filtering algorithm.A distributed data processing model and execution environment that runs on large clusters of commodity machines.
- HDFS- A distributed filesystem that runs on large clusters of commodity machines.
- Pig- A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
- HBase- A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
- ZooKeeper- A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
- Hive- A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
- Chukwa- A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports. (At the time of this writing, Chukwa had only recently graduated from a “contrib” module in Core to its own subproject.)
Some additional important Projects:
- Nutch - an effort to build an open source search engine based on Lucene and Hadoop. Also created by Doug Cutting.
- Datameer Analytics Solution (DAS) – data source integration, storage, analytics engine and visualization
- Hypertable - HBase alternative
- Apache Mahout - Machine Learning algorithms implemented on Hadoop
- Apache Cassandra - A column-oriented database that supports access from Hadoop
- HPCC - LexisNexis Risk Solutions High Performance Computing Cluster
- Sector/Sphere - Open source distributed storage and processing
JobTracker and TaskTracker: the MapReduce engine
The file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker
nodes in the cluster, striving to keep the work as close to the data as
possible. With a rack-aware filesystem, the JobTracker knows which node
contains the data, and which other machines are nearby. If the work
cannot be hosted on the actual node where the data resides, priority is
given to nodes in the same rack. This reduces network traffic on the
main backbone network. If a TaskTracker fails or times out, that part of
the job is rescheduled.
The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.
*If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was
lost. Hadoop version 0.21 added some checkpointing to this process; the
JobTracker records what it is up to in the filesystem. When a
JobTracker starts up, it looks for any such data, so that it can restart
work from where it left off. In earlier versions of Hadoop, all active
work was lost when a JobTracker restarted.
Known limitations of this approach are:
- The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.
- If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative-execution enabled, however, a single task can be executed on multiple slave nodes.
Scheduling
By default Hadoop uses FIFO, and optional 5 scheduling priorities to schedule jobs from a work queue.In version 0.19 the job scheduler was refactored out of the JobTracker,
and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).
Fair scheduler
The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and QoS for production jobs. The fair scheduler has three basic concepts.
- Jobs are grouped into Pools.
- Each pool is assigned a guaranteed minimum share.
- Excess capacity is split between jobs.
By default jobs that are uncategorized go into a default pool. Pools
have to specify the minimum number of map slots, reduce slots, and a
limit on the number of running jobs.
Capacity scheduler
The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.
- Jobs are submitted into queues.
- Queues are allocated a fraction of the total resource capacity.
- Free resources are allocated to queues beyond their total capacity.
- Within a queue a job with a high level of priority will have access to the queue's resources.
There is no preemption once a job is running.
Hyrax:
The goal of Hyrax is to
develop a mobile cloud infrastructure that enables smartphone applications with
distributed data and computation. Hyrax allows applications to
conveniently use data and execute computing jobs on smartphone
networks and heterogeneous networks of phones and servers. Research has been focused primarily on implementation and
evaluation of mobile cloud computing infrastructure based on MapReduce. As we read earlier, there are four types of processes in Hadoop
instances: NameNode, JobTracker, DataNode, and TaskTracker. There is one NameNode
and one JobTracker in a Hadoop cluster. NameNode
schedules jobs and coordinates sub-tasks among TaskTrackers. A DataNode
instance and a TaskTracker instance both run on each worker machine. DataNode
stores and provides access to data blocks, and TaskTracker executes tasks
assigned to it by JobTracker. Clients access files by first requesting
block locations from NameNode and then requesting blocks directly from these
locations.
Hyrax ports Hadoop to the Android platform. In Hyrax, each machine runs one instance of NameNode and one
instance of JobTracker. DataNode and TaskTracker are run on each phone in separate
Android service processes within the same application. Android applications may
consist of multiple processes, some of which run as background services.
Since DataNode
and TaskTracker are run as Android services, they can run in the background of other
applications.
A thread is also spawned to record information about the systemload—including power level and CPU, memory, network, and disk I/O statistics—into the local file system. Within the application, a server is run to allow external scripts to control data uploading, kill the program, and check the program status. Below Fig. illustrates the data interactions between all the software components on each phone.
A thread is also spawned to record information about the systemload—including power level and CPU, memory, network, and disk I/O statistics—into the local file system. Within the application, a server is run to allow external scripts to control data uploading, kill the program, and check the program status. Below Fig. illustrates the data interactions between all the software components on each phone.
Hyrax is implemented in a real
testbed consisting of a cluster of 10 Android G1 (HTC Dream) phones and 5 HTC
Magic phones, each running Android 1.5 Cupcake. Since Android does not support peer-to-peer
networking yet, the phones communicate with each other on an isolated 802.11g
network via a Linksys WRT54G wireless router with no firmware modifications.
NameNode and JobTracker processes are run on a desktop machine connected behind
this router via Ethernet. The phones are connected via USB to a controller that
executes experiment scripts. These scripts are used to install Hyrax, initialize
the cluster, run benchmarks, and collect and post process data.
To determine the advantages and drawbacks of Hyrax, an application was developed on it. The Hyrax multimedia search and sharing application, HyraxTube, allows users to browse videos and images stored on a network of phones and search by time, location, and quality. Quality ratings based on sensor data are generated by periodically executing a MapReduce job. Requests are serviced by reading results generated by the MapReduce job from HDFS. The client interface is implemented as a web application so that it can be used on mobile devices and desktop machines.