Tuesday, March 15, 2011

Hadoop Compute Cluster: Summary

What is Hadoop?
Hadoop is inspired by Google’s Architecture – Map Reduce and Google File System. It is a top level Apache project. It is completely written in Java.

Advantages of using Hadoop
Hadoop compute clusters are built on cheap commodity hardware. Hadoop automatically handles node failures and data replication. Hadoop is a good framework for building batch data processing system. Hadoop provides API and framework implementation for working with Map Reduce. The Map Reduce implementation is provided on top of Hadoop Job Infrastructure.
Hadoop Job infrastructure can manage and handle HUGE amounts of data in the range of peta bytes.

When is Hadoop a good choice?
Hadoop is a good choice for building batch processing systems to process huge amounts of unstructured data. Also, to use Hadoop effectively, the system should process data in parallel. Also, a definite advantage of Hadoop is that when you want to use cheap hardware and scale the cluster horizontally.

When is Hadoop NOT a good choice?
Hadoop is not a good choice for building systems that carry out intense calculations with little or no data. Also, for systems where requirements restrict that the processing cannot be easily made parallel. Also, since Hadoop is a batch processing system, there is lot of latency between request and response and so is not suitable for interactive system.

Hadoop Eco-system
Several projects are supporting Hadoop Eco-system:
·         Hadoop Common – The common utilities that support the other Hadoop sub – projects
·         HDFS – A distributed file system that provides high throughput access to application data
·         MapReduce – A software framework for distributed processing of large data sets on compute clusters.
·         Avro – A data serialization system.
·         Chukwa – A data collection system for managing large distributed systems
·         Hbase – A scalable, distributed database that supports structured data storage for large tables
·         Hive – A data warehouse  infrastructure that provides data summarization and ad hoc querying
·         Mahout – A scalable machine learning and data mining library
·         Pig – A high – level data-flow language and execution framework for parallel computation
·         ZooKeeper – A high – performance coordination service for distributed applications

Hadoop Distributed File System (HDFS):
HDFS is a file system internally used by Hadoop framework for buffering, transferring and copying data across nodes within the Hadoop cluster.

Hadoop copies each file data to multiple nodes. This allows for node failures without data loss.

Hadoop Architecture Overview

The above diagram depicts Hadoop Architecture. There is one Name node per cluster. This imposes high risk by pausing as a Single Point of Failure for hardware. This can be prevented by mounting the Name node on multiple file systems to provide data redundancy. Name node manages the file system name space and meta data.
There are lots of data nodes within a cluster. It manages data blockes which represent fragments of files on HDFS. Each block is replicated wihin data nodes (at least 3 copies), so this prevents failure.
There is exactly one Job Tracker per cluster. Clients submit job requests to job tracker. Job tracker schedules and monitors Map Reduce jobs on task trackers.
There are typically many task trackers. These are responsible for executing map reduce operations. Task trackers are also responsible for reading and writing input and output data to Map Reduce jobs.

Hadoop modes of operation
Hadoop operates in three modes of operation:
·         Stand – alone
·         Pseudo – distributed
·         Fully-distributed (Cluster)

The stand-alone and Pseudo-distributed modes of operations are development modes of operations used by development teams to carry out development of Map Reduce jobs.
The stand-alone mode of development does not use HDFS but uses local operating system files system. Hadoop and all application code runs inside a single Java process. This is on a Single machine.
The pseudo – distributed mode of Hadoop runs all the processes like Name node, Data node, Job Tracker and Task Tracker as separate processes. HDFS file system is used in this mode of operation. This is again on a single machine.
Both the above approaches are used by developers to carry out development activities.
The third mode of Hadoop operation is Fully – Distributed Cluster. HDFS file system is used. This is over a cluster of machines. Read / write processes are balanced over a cluster of nodes.  There are several daemon threads of Name node, Data node, Task Tracker and Job Tracker.

Hadoop is a leading open source framework where distributed parallel processing is required. There are many companies supporting Hadoop development including Apache, Cloudera, etc.


No comments:

Post a Comment