Hadoop Tutorial: Components, Architecture, Data Processing

Introduction to Hadoop

Hadoop is an open-source framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Key Components of Hadoop

  1. Hadoop Distributed File System (HDFS)
    • A distributed file system designed to run on commodity hardware.
    • Highly fault-tolerant and designed to be deployed on low-cost hardware.
    • Provides high throughput access to application data and is suitable for applications that have large datasets.
  2. MapReduce
    • A programming model for processing large datasets with a parallel, distributed algorithm on a cluster.
    • Composed of two main functions:
      • Map: Takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
      • Reduce: Takes the output from a map as input and combines those data tuples into a smaller set of tuples.
  3. Hadoop Common
    • The common utilities that support the other Hadoop modules.
    • Includes libraries and utilities needed by other Hadoop modules.
  4. YARN (Yet Another Resource Negotiator)
    • A resource-management platform responsible for managing computing resources in clusters and using them for scheduling users’ applications.
    • Decouples resource management and job scheduling/monitoring functions into separate daemons.

Hadoop Ecosystem Components

  1. Hive
    • A data warehouse infrastructure built on top of Hadoop.
    • Provides data summarization, query, and analysis.
    • Uses HiveQL, a SQL-like language.
  2. Pig
    • A high-level platform for creating MapReduce programs used with Hadoop.
    • Uses a language called Pig Latin, which abstracts the programming from the Java MapReduce idiom.
  3. HBase
    • A distributed, scalable, big data store.
    • Runs on top of HDFS.
    • Provides random, real-time read/write access to big data.
  4. Sqoop
    • A tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
  5. Flume
    • A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
  6. Oozie
    • A workflow scheduler system to manage Hadoop jobs.
  7. Zookeeper
    • A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  8. Spark
    • A fast and general-purpose cluster-computing system.
    • Provides in-memory processing to speed up big data analysis.

Hadoop Architecture

Hadoop follows a Master-Slave architecture for both data storage and data processing.

  1. HDFS Architecture:
    • NameNode (Master):
      • Manages the file system namespace and regulates access to files by clients.
      • Maintains the file system tree and metadata for all the files and directories in the tree.
    • DataNode (Slave):
      • Responsible for storing the actual data in HDFS.
      • Serves read and write requests from the file system’s clients.
  2. MapReduce Architecture:
    • JobTracker (Master):
      • Manages resources and job scheduling.
    • TaskTracker (Slave):
      • Executes the tasks as directed by the JobTracker.
  3. YARN Architecture:
    • ResourceManager (Master):
      • Manages resources and scheduling.
    • NodeManager (Slave):
      • Manages resources on a single node.

Hadoop Workflow

  1. Data Ingestion:
    • Data can be ingested into HDFS using tools like Flume, Sqoop, or by directly placing data into HDFS.
  2. Data Storage:
    • Data is split into blocks and distributed across the cluster in HDFS.
    • Replication ensures data reliability and fault tolerance.
  3. Data Processing:
    • Using MapReduce or other processing engines like Spark, data is processed in parallel across the cluster.
    • Intermediate data is stored in local disks and final output is stored back in HDFS.
  4. Data Analysis:
    • Tools like Hive, Pig, or custom MapReduce jobs are used to query and analyze data.
    • Results can be stored back in HDFS or exported to other systems.

Getting Started with Hadoop

  1. Installation:
    • Hadoop can be installed in different modes:
      • Standalone Mode: Default mode, mainly for debugging and development.
      • Pseudo-Distributed Mode: All Hadoop services run on a single node.
      • Fully Distributed Mode: Hadoop runs on multiple nodes, suitable for production.
  2. Configuration:
    • Key configuration files include core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml.
    • These files need to be correctly configured for Hadoop to run efficiently.
  3. Running Hadoop Jobs:
    • Hadoop jobs can be written in Java, Python (with Hadoop streaming), or other languages.
    • Jobs are submitted using the Hadoop command line or APIs.

Example of Running a Hadoop MapReduce Job

WordCount Example:

Mapper (WordCountMapper.java):

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        String[] words = value.toString().split("\\s+");
        for (String str : words) {
            word.set(str);
            context.write(word, one);
        }
    }
}

Reducer (WordCountReducer.java):

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

Driver (WordCount.java):

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Running the Job:hadoop jar wordcount.jar WordCount /input_dir /output_dir

Hadoop is a powerful framework for processing large datasets in a distributed environment. Understanding its core components, architecture, and how to run jobs is essential for leveraging its full potential in big data applications.

Big Data Processing on Hadoop: Internal Mechanics and Data Pipeline Architecture

Hadoop processes large datasets through a series of well-orchestrated steps. Here’s a detailed step-by-step explanation of how Hadoop works internally to handle big data:

1. Data Ingestion

Methods of Data Ingestion

  • Flume: Used for ingesting large amounts of streaming data into HDFS.
  • Sqoop: Used for importing data from relational databases into HDFS.
  • Direct File Upload: Data can be uploaded directly to HDFS using Hadoop commands or a web interface.

2. Data Storage (HDFS)

HDFS is designed to store very large datasets reliably and to stream those datasets at high bandwidth to user applications.

HDFS Components

  • NameNode:
    • Manages the filesystem namespace and metadata.
    • Keeps track of files, directories, and blocks in the HDFS.
  • DataNodes:
    • Store the actual data blocks.
    • Regularly report to the NameNode with block information.

Data Storage Mechanism

  1. Data Splitting: When a file is ingested into HDFS, it is split into blocks (default 128 MB per block).
  2. Block Replication: Each block is replicated across multiple DataNodes (default replication factor is 3) for fault tolerance.
  3. Metadata Storage: The NameNode keeps metadata about the blocks and their locations.

3. Data Processing (MapReduce)

MapReduce is a programming model used for processing large data sets with a distributed algorithm on a cluster. It was introduced by Google and is widely used in big data processing frameworks like Apache Hadoop.

MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm.

Key Concepts

Map: The Map function takes a set of input key/value pairs and produces a set of intermediate key/value pairs.

Reduce: The Reduce function takes the intermediate key/value pairs produced by the Map function and merges them to produce the final result.

MapReduce Components
  • JobTracker: (Master Node)
    • Manages resources and job scheduling.
    • Coordinates the execution of MapReduce jobs.
  • TaskTracker: (Slave Node)
    • Executes individual tasks as directed by the JobTracker.
MapReduce Processing Steps
  1. Job Submission:
    • A client submits a job (a set of MapReduce tasks) to the JobTracker.
  2. Input Splits:
    • The input data is split into logical input splits, which are assigned to mappers.
  3. Mapping Phase:
    • Map Task: The map function processes the input split and outputs key-value pairs.
    • Intermediate data is stored on local disks.
  4. Shuffling and Sorting Phase:
    • The system sorts the intermediate data and transfers it from the map outputs to the reducers.
  5. Reducing Phase:
    • Reduce Task: The reduce function processes the sorted key-value pairs and outputs the final result.
    • The output is stored in HDFS.
Scenario: Word Count

One of the classic examples to explain MapReduce is the “Word Count” problem. Let’s say you have a large number of documents, and you want to count the number of occurrences of each word across all documents.

Steps in MapReduce

Step 1: Splitting

The input data is split into fixed-size pieces called “splits” or “chunks”. Each chunk will be processed independently.

Step 2: Mapping

The Map function processes each chunk of data and produces intermediate key/value pairs. For the Word Count example, the Map function will output each word as a key and the number 1 as the value.

Step 3: Shuffling and Sorting

The framework sorts and transfers the intermediate data to the reducers. All values associated with the same key are sent to the same reducer.

Step 4: Reducing

The Reduce function processes each key and its associated values, combining them to produce the final result. For the Word Count example, the Reduce function will sum the values for each word.

Step 5: Output

The final output is written to the output files.

Detailed Example: Word Count

Input Data

Assume we have three documents:

  1. Document 1: “Hello world”
  2. Document 2: “Hello Hadoop”
  3. Document 3: “Hello world Hello”

Step-by-Step Execution

1. Map Phase

Each document is split into words, and each word is emitted as a key with a value of 1.

Input Split (Document)Map Output (Intermediate Key/Value Pairs)
Document 1(“Hello”, 1), (“world”, 1)
Document 2(“Hello”, 1), (“Hadoop”, 1)
Document 3(“Hello”, 1), (“world”, 1), (“Hello”, 1)

2. Shuffle and Sort Phase

The framework sorts the intermediate data and groups all values associated with the same key.

Intermediate KeyGrouped Values
“Hello”[1, 1, 1, 1]
“world”[1, 1]
“Hadoop”[1]
3. Reduce Phase

The Reduce function sums the values for each key to get the final word count.

Intermediate KeyFinal Output (Key/Value)
“Hello”(“Hello”, 4)
“world”(“world”, 2)
“Hadoop”(“Hadoop”, 1)
MapReduce Code Example (Hadoop)

Mapper Code (Java)

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        String[] words = value.toString().split("\\s+");
        for (String str : words) {
            word.set(str);
            context.write(word, one);
        }
    }
}

Reducer Code (Java)

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

Driver Code (Java)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCountMapper.class);
        job.setCombinerClass(WordCountReducer.class);
        job.setReducerClass(WordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Execution

  1. Compile the code: javac -classpath $(hadoop classpath) -d wordcount_classes WordCountMapper.java WordCountReducer.java WordCount.java
  2. Create a JAR file: jar -cvf wordcount.jar -C wordcount_classes/ .
  3. Run the job: hadoop jar wordcount.jar WordCount /input/path /output/path

MapReduce is a powerful model for processing large datasets in parallel across a cluster. The Word Count example demonstrates how the Map and Reduce phases work together to transform and aggregate data. By understanding this fundamental example, you can apply MapReduce concepts to solve more complex big data problems.

4. Resource Management (YARN)

YARN (Yet Another Resource Negotiator) is the resource management layer of Hadoop.

YARN Components

  • ResourceManager:
    • Allocates cluster resources to various applications.
    • Maintains the list of available resources and applications.
  • NodeManager:
    • Monitors resource usage (CPU, memory, disk) of individual nodes.
    • Reports to the ResourceManager.
  • ApplicationMaster:
    • Manages the lifecycle of a specific application.
    • Negotiates resources from the ResourceManager and works with the NodeManager to execute tasks.

5. Data Analysis (Using Hive, Pig, Spark)

Hive

  • Data Warehousing: Allows querying and managing large datasets residing in distributed storage.
  • HiveQL: SQL-like language for querying data.

Pig

  • Data Flow Language: Uses Pig Latin for expressing data analysis programs.
  • Execution Engine: Converts Pig Latin scripts into MapReduce jobs.

Spark

  • In-Memory Processing: Offers in-memory computation for faster data processing.
  • Rich API: Provides APIs in Java, Scala, Python, and R for different operations.

6. Data Export (Sqoop, Flume)

  • Sqoop: Transfers data from HDFS to relational databases.
  • Flume: Transfers log data from HDFS to another storage system.
Detailed Data Pipeline Workflow

Step-by-Step Data Pipeline

  1. Data Ingestion
    • Use Flume to collect log data and store it in HDFS.
    • Use Sqoop to import relational database tables into HDFS.
  2. Data Storage
    • The ingested data is stored as HDFS blocks across multiple DataNodes.
    • The NameNode maintains metadata about the stored data.
  3. Data Processing
    • Submit a MapReduce job to process the stored data.
    • The JobTracker divides the job into tasks and assigns them to TaskTrackers.
    • The mapping phase processes the input splits, generating intermediate key-value pairs.
    • Intermediate data is shuffled and sorted before being passed to the reducers.
    • The reducing phase processes the intermediate data and stores the final output in HDFS.
  4. Resource Management
    • YARN manages the resources required for executing the MapReduce job.
    • The ResourceManager allocates resources, while the NodeManager monitors resource usage.
  5. Data Analysis
    • Use Hive to run SQL-like queries on the processed data.
    • Use Pig to write data flow scripts for further analysis.
    • Use Spark for in-memory data processing and advanced analytics.
  6. Data Export
    • Use Sqoop to export the processed data from HDFS to a relational database for reporting.
    • Use Flume to transfer processed log data to another storage system.

Hadoop’s architecture enables the processing of large datasets through distributed storage and parallel processing. By using HDFS for storage, MapReduce for processing, and YARN for resource management, Hadoop provides a robust and scalable framework for big data applications. The integration with tools like Hive, Pig, and Spark further enhances its capability for data analysis and processing.

Summary:-

  1. HDFS (Hadoop Distributed File System): A distributed file system for storing data.
  2. MapReduce: A programming model for processing data in parallel.
  3. YARN (Yet Another Resource Negotiator): A resource management layer for managing resources and scheduling jobs.

Architecture:

  1. NameNode: The master node that manages HDFS metadata.
  2. DataNode: Slave nodes that store data in HDFS.
  3. ResourceManager: The master node that manages resources and schedules jobs in YARN.
  4. NodeManager: Slave nodes that manage resources and execute jobs in YARN.
  5. ApplicationMaster: A component that manages the execution of a job.

Running Jobs:

  1. Submit a job: Use the hadoop jar command to submit a job to YARN.
  2. ResourceManager schedules the job: The ResourceManager schedules the job and allocates resources.
  3. ApplicationMaster manages the job: The ApplicationMaster manages the execution of the job.
  4. MapReduce executes the job: MapReduce executes the job in parallel across the cluster.
  5. Output is written to HDFS: The output of the job is written to HDFS.

Steps to run a job:

  1. Prepare the input data and store it in HDFS.
  2. Write a MapReduce program to process the data.
  3. Package the program into a JAR file.
  4. Submit the job to YARN using the hadoop jar command.
  5. Monitor the job’s progress using the YARN web interface.
  6. Retrieve the output from HDFS.

Example command to run a job:

hadoop jar myjob.jar input output

This command submits a job to YARN, which executes the myjob.jar program on the input data and writes the output to HDFS.