Big Data Processing on Hadoop: Internal Mechanics and Data Pipeline Architecture
Hadoop processes large datasets through a series of well-orchestrated steps. Here’s a detailed step-by-step explanation of how Hadoop works internally to handle big data:
1. Data Ingestion
Methods of Data Ingestion
- Flume: Used for ingesting large amounts of streaming data into HDFS.
- Sqoop: Used for importing data from relational databases into HDFS.
- Direct File Upload: Data can be uploaded directly to HDFS using Hadoop commands or a web interface.
2. Data Storage (HDFS)
HDFS is designed to store very large datasets reliably and to stream those datasets at high bandwidth to user applications.
HDFS Components
- NameNode:
- Manages the filesystem namespace and metadata.
- Keeps track of files, directories, and blocks in the HDFS.
- DataNodes:
- Store the actual data blocks.
- Regularly report to the NameNode with block information.
Data Storage Mechanism
- Data Splitting: When a file is ingested into HDFS, it is split into blocks (default 128 MB per block).
- Block Replication: Each block is replicated across multiple DataNodes (default replication factor is 3) for fault tolerance.
- Metadata Storage: The NameNode keeps metadata about the blocks and their locations.
3. Data Processing (MapReduce)
MapReduce is a programming model used for processing large data sets with a distributed algorithm on a cluster. It was introduced by Google and is widely used in big data processing frameworks like Apache Hadoop.
MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm.
Key Concepts
Map: The Map function takes a set of input key/value pairs and produces a set of intermediate key/value pairs.
Reduce: The Reduce function takes the intermediate key/value pairs produced by the Map function and merges them to produce the final result.
MapReduce Components
- JobTracker: (Master Node)
- Manages resources and job scheduling.
- Coordinates the execution of MapReduce jobs.
- TaskTracker: (Slave Node)
- Executes individual tasks as directed by the JobTracker.
MapReduce Processing Steps
- Job Submission:
- A client submits a job (a set of MapReduce tasks) to the JobTracker.
- Input Splits:
- The input data is split into logical input splits, which are assigned to mappers.
- Mapping Phase:
- Map Task: The map function processes the input split and outputs key-value pairs.
- Intermediate data is stored on local disks.
- Shuffling and Sorting Phase:
- The system sorts the intermediate data and transfers it from the map outputs to the reducers.
- Reducing Phase:
- Reduce Task: The reduce function processes the sorted key-value pairs and outputs the final result.
- The output is stored in HDFS.
Scenario: Word Count
One of the classic examples to explain MapReduce is the “Word Count” problem. Let’s say you have a large number of documents, and you want to count the number of occurrences of each word across all documents.
Steps in MapReduce
Step 1: Splitting
The input data is split into fixed-size pieces called “splits” or “chunks”. Each chunk will be processed independently.
Step 2: Mapping
The Map function processes each chunk of data and produces intermediate key/value pairs. For the Word Count example, the Map function will output each word as a key and the number 1 as the value.
Step 3: Shuffling and Sorting
The framework sorts and transfers the intermediate data to the reducers. All values associated with the same key are sent to the same reducer.
Step 4: Reducing
The Reduce function processes each key and its associated values, combining them to produce the final result. For the Word Count example, the Reduce function will sum the values for each word.
Step 5: Output
The final output is written to the output files.
Detailed Example: Word Count
Input Data
Assume we have three documents:
- Document 1: “Hello world”
- Document 2: “Hello Hadoop”
- Document 3: “Hello world Hello”
Step-by-Step Execution
1. Map Phase
Each document is split into words, and each word is emitted as a key with a value of 1.
Input Split (Document) | Map Output (Intermediate Key/Value Pairs) |
---|---|
Document 1 | (“Hello”, 1), (“world”, 1) |
Document 2 | (“Hello”, 1), (“Hadoop”, 1) |
Document 3 | (“Hello”, 1), (“world”, 1), (“Hello”, 1) |
2. Shuffle and Sort Phase
The framework sorts the intermediate data and groups all values associated with the same key.
Intermediate Key | Grouped Values |
---|---|
“Hello” | [1, 1, 1, 1] |
“world” | [1, 1] |
“Hadoop” | [1] |
3. Reduce Phase
The Reduce function sums the values for each key to get the final word count.
Intermediate Key | Final Output (Key/Value) |
---|---|
“Hello” | (“Hello”, 4) |
“world” | (“world”, 2) |
“Hadoop” | (“Hadoop”, 1) |
MapReduce Code Example (Hadoop)
Mapper Code (Java)
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split("s+");
for (String str : words) {
word.set(str);
context.write(word, one);
}
}
}
Reducer Code (Java)
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Driver Code (Java)
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Execution
- Compile the code:
javac -classpath $(hadoop classpath) -d wordcount_classes WordCountMapper.java WordCountReducer.java WordCount.java
- Create a JAR file:
jar -cvf wordcount.jar -C wordcount_classes/ .
- Run the job:
hadoop jar wordcount.jar WordCount /input/path /output/path
MapReduce is a powerful model for processing large datasets in parallel across a cluster. The Word Count example demonstrates how the Map and Reduce phases work together to transform and aggregate data. By understanding this fundamental example, you can apply MapReduce concepts to solve more complex big data problems.
4. Resource Management (YARN)
YARN (Yet Another Resource Negotiator) is the resource management layer of Hadoop.
YARN Components
- ResourceManager:
- Allocates cluster resources to various applications.
- Maintains the list of available resources and applications.
- NodeManager:
- Monitors resource usage (CPU, memory, disk) of individual nodes.
- Reports to the ResourceManager.
- ApplicationMaster:
- Manages the lifecycle of a specific application.
- Negotiates resources from the ResourceManager and works with the NodeManager to execute tasks.
5. Data Analysis (Using Hive, Pig, Spark)
Hive
- Data Warehousing: Allows querying and managing large datasets residing in distributed storage.
- HiveQL: SQL-like language for querying data.
Pig
- Data Flow Language: Uses Pig Latin for expressing data analysis programs.
- Execution Engine: Converts Pig Latin scripts into MapReduce jobs.
Spark
- In-Memory Processing: Offers in-memory computation for faster data processing.
- Rich API: Provides APIs in Java, Scala, Python, and R for different operations.
6. Data Export (Sqoop, Flume)
- Sqoop: Transfers data from HDFS to relational databases.
- Flume: Transfers log data from HDFS to another storage system.
Detailed Data Pipeline Workflow
Step-by-Step Data Pipeline
- Data Ingestion
- Use Flume to collect log data and store it in HDFS.
- Use Sqoop to import relational database tables into HDFS.
- Data Storage
- The ingested data is stored as HDFS blocks across multiple DataNodes.
- The NameNode maintains metadata about the stored data.
- Data Processing
- Submit a MapReduce job to process the stored data.
- The JobTracker divides the job into tasks and assigns them to TaskTrackers.
- The mapping phase processes the input splits, generating intermediate key-value pairs.
- Intermediate data is shuffled and sorted before being passed to the reducers.
- The reducing phase processes the intermediate data and stores the final output in HDFS.
- Resource Management
- YARN manages the resources required for executing the MapReduce job.
- The ResourceManager allocates resources, while the NodeManager monitors resource usage.
- Data Analysis
- Use Hive to run SQL-like queries on the processed data.
- Use Pig to write data flow scripts for further analysis.
- Use Spark for in-memory data processing and advanced analytics.
- Data Export
- Use Sqoop to export the processed data from HDFS to a relational database for reporting.
- Use Flume to transfer processed log data to another storage system.
Hadoop’s architecture enables the processing of large datasets through distributed storage and parallel processing. By using HDFS for storage, MapReduce for processing, and YARN for resource management, Hadoop provides a robust and scalable framework for big data applications. The integration with tools like Hive, Pig, and Spark further enhances its capability for data analysis and processing.
Summary:-
- HDFS (Hadoop Distributed File System): A distributed file system for storing data.
- MapReduce: A programming model for processing data in parallel.
- YARN (Yet Another Resource Negotiator): A resource management layer for managing resources and scheduling jobs.
Architecture:
- NameNode: The master node that manages HDFS metadata.
- DataNode: Slave nodes that store data in HDFS.
- ResourceManager: The master node that manages resources and schedules jobs in YARN.
- NodeManager: Slave nodes that manage resources and execute jobs in YARN.
- ApplicationMaster: A component that manages the execution of a job.
Running Jobs:
- Submit a job: Use the
hadoop jar
command to submit a job to YARN. - ResourceManager schedules the job: The ResourceManager schedules the job and allocates resources.
- ApplicationMaster manages the job: The ApplicationMaster manages the execution of the job.
- MapReduce executes the job: MapReduce executes the job in parallel across the cluster.
- Output is written to HDFS: The output of the job is written to HDFS.
Steps to run a job:
- Prepare the input data and store it in HDFS.
- Write a MapReduce program to process the data.
- Package the program into a JAR file.
- Submit the job to YARN using the
hadoop jar
command. - Monitor the job’s progress using the YARN web interface.
- Retrieve the output from HDFS.
Example command to run a job:
hadoop jar myjob.jar input output
This command submits a job to YARN, which executes the myjob.jar
program on the input data and writes the output to HDFS.