Tuesday, May 22, 2012

Hadoop Overview - Big data Control though Mobile

 Hi Friends,

 Here I am sharing about the overview of "Hadoop" framework and related projects which is necessary to improve the performance and maintains scalability for Big data. In future, I'll be adding how this stuffs are more useful for the Mobile end-client and computing in detail. Also, I have exposed here how Android smartphone does play with Hadoop concepts in practical.

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses. 
 "Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework."

Architecture
Hadoop consists of the Hadoop Common, which provides access to the filesystems supported by Hadoop. The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community.
For effective scheduling of work, every Hadoop-compatible filesystem should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, so reducing backbone traffic. The Hadoop Distributed File System (HDFS) uses this when replicating data, to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data may still be readable.
  • A small Hadoop cluster will include a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes, and compute-only worker nodes; these are normally only used in non-standard applications. Hadoop requires JRE 1.6 or higher. 
  • In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the filesystem index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thus preventing filesystem corruption and reducing loss of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate filesystem, the NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the filesystem-specific equivalent.

Meet Hadoop:
We live in the data age. It’s not easy to measure the total volume of data stored elec- tronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world.
This flood of data is coming from many sources. Consider the following:
  • The New York Stock Exchange generates about one terabyte of new trade data per day.
  • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
  • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
  • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.
  • The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year.etc,


So there’s a lot of data out there. But you are probably wondering how it affects you. Most of the data is locked up in the largest web properties (like search engines). The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it. 
 
Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS, renamed from NDFS), the other subprojects provide complementary services, or build on the core to add higher-level abstractions. The subprojects, and where they sit in the technology stack, and described briefly here: 
  • Core-A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).
  • Avro- A data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)
  • MapReduce- Hadoop's fundamental data filtering algorithm.A distributed data processing model and execution environment that runs on large clusters of commodity machines.
  • HDFS- A distributed filesystem that runs on large clusters of commodity machines. 
  • Pig- A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
  • HBase- A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
  • ZooKeeper- A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
  • Hive- A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
  • Chukwa- A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports. (At the time of this writing, Chukwa had only recently graduated from a “contrib” module in Core to its own subproject.)


 
Some additional important Projects: 
  • Nutch - an effort to build an open source search engine based on Lucene and Hadoop. Also created by Doug Cutting. 
  • Datameer Analytics Solution (DAS) – data source integration, storage, analytics engine and visualization
  • Hypertable - HBase alternative
  • Apache Mahout - Machine Learning algorithms implemented on Hadoop
  • Apache Cassandra - A column-oriented database that supports access from Hadoop 
  • HPCC - LexisNexis Risk Solutions High Performance Computing Cluster 
  • Sector/Sphere - Open source distributed storage and processing
JobTracker and TaskTracker: the MapReduce engine
The file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled.
The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.
*If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the filesystem. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off. In earlier versions of Hadoop, all active work was lost when a JobTracker restarted.
Known limitations of this approach are:
  • The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.
  • If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative-execution enabled, however, a single task can be executed on multiple slave nodes.
Scheduling
By default Hadoop uses FIFO, and optional 5 scheduling priorities to schedule jobs from a work queue.In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).
Fair scheduler
The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and QoS for production jobs. The fair scheduler has three basic concepts.
  1. Jobs are grouped into Pools.
  2. Each pool is assigned a guaranteed minimum share.
  3. Excess capacity is split between jobs.
By default jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.
Capacity scheduler
The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.
  • Jobs are submitted into queues.
  • Queues are allocated a fraction of the total resource capacity.
  • Free resources are allocated to queues beyond their total capacity.
  • Within a queue a job with a high level of priority will have access to the queue's resources.
There is no preemption once a job is running.

Hyrax:
The goal of Hyrax is to develop a mobile cloud infrastructure that enables smartphone applications with distributed data and computation. Hyrax allows applications to conveniently use data and execute computing jobs on smartphone networks and heterogeneous networks of phones and servers. Research has been focused primarily on implementation and evaluation of mobile cloud computing infrastructure based on MapReduce. As we read earlier, there are four types of processes in Hadoop instances: NameNode, JobTracker, DataNode, and TaskTracker. There is one NameNode and one JobTracker in a Hadoop cluster. NameNode schedules jobs and coordinates sub-tasks among TaskTrackers. A DataNode instance and a TaskTracker instance both run on each worker machine. DataNode stores and provides access to data blocks, and TaskTracker executes tasks assigned to it by JobTracker. Clients access files by first requesting block locations from NameNode and then requesting blocks directly from these locations. 


Hyrax ports Hadoop to the Android platform. In Hyrax, each machine runs one instance of NameNode and one instance of JobTracker.  DataNode and TaskTracker are run on each phone in separate Android service processes within the same application. Android applications may consist of multiple processes, some of which run as background services. Since DataNode and TaskTracker are run as Android services, they can run in the background of other applications.
A thread is also spawned to record information about the systemload—including power level and CPU, memory, network, and disk I/O statistics—into the local file system. Within the application, a server is run to allow external scripts to control data uploading, kill the program, and check the program status. Below Fig. illustrates the data interactions between all the software components on each phone.

Hyrax is implemented in a real testbed consisting of a cluster of 10 Android G1 (HTC Dream) phones and 5 HTC Magic phones, each running Android 1.5 Cupcake. Since Android does not support peer-to-peer networking yet, the phones communicate with each other on an isolated 802.11g network via a Linksys WRT54G wireless router with no firmware modifications. NameNode and JobTracker processes are run on a desktop machine connected behind this router via Ethernet. The phones are connected via USB to a controller that executes experiment scripts. These scripts are used to install Hyrax, initialize the cluster, run benchmarks, and collect and post process data.

To determine the advantages and drawbacks of Hyrax, an application was developed on it. The Hyrax multimedia search and sharing application, HyraxTube, allows users to browse videos and images stored on a network of phones and search by time, location, and quality. Quality ratings based on sensor data are generated by periodically executing a MapReduce job. Requests are serviced by reading results generated by the MapReduce job from HDFS. The client interface is implemented as a web application so that it can be used on mobile devices and desktop machines.

 

8 comments:

  1. Sir can i get Hyrax software.I am thinking of making the project on mobile cloud and later doing research on it


    ReplyDelete
    Replies
    1. me too, i will get Hyrax application and about this

      Delete
  2. How do you port hadoop on android

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. We never miss a single post on this blog about hadoop. After attending hadoop online training, this site worked as a supplement to our technical knowledge about the subject related to cloud and other related platforms like hadoop.

    ReplyDelete
  5. DreamHost is the best website hosting company with plans for any hosting needs.

    ReplyDelete
  6. Submit your blog or website now for indexing in Google and 300+ search engines!

    Over 200,000 websites indexed!

    SUBMIT RIGHT NOW via I NEED HITS!

    ReplyDelete