What is Hadoop?
Hadoop is an open-source, distributed computing framework that allows for the processing and storage of large datasets across a cluster of computers. It was created by Doug Cutting and Mike Cafarella and is now maintained by the Apache Software Foundation.
History of Hadoop
Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers, which were published in 2003 and 2004, respectively. The first version of Hadoop, version 0.1.0, was released in April 2006. Since then, Hadoop has become one of the most popular big data processing frameworks in the world.
Benefits of Hadoop
- Scalability: Hadoop can handle large amounts of data by distributing the processing across a cluster of nodes.
- Flexibility: Hadoop can process a wide variety of data formats, including structured, semi-structured, and unstructured data.
- Cost-effective: Hadoop is open-source and can run on commodity hardware, making it a cost-effective solution for big data processing.
- Fault-tolerant: Hadoop can detect and recover from node failures, ensuring that data processing continues uninterrupted.
How Hadoop Works
Hadoop consists of two main components:
- Hadoop Distributed File System (HDFS): HDFS is a distributed storage system that stores data across a cluster of nodes. It is designed to handle large files and provides high throughput access to data.
- MapReduce: MapReduce is a programming model and software framework that allows developers to write applications that process large datasets in parallel across a cluster of nodes.
Here’s a high-level overview of how Hadoop works:
- Data Ingestion: Data is ingested into HDFS from various sources, such as log files, social media, or sensors.
- Data Processing: MapReduce programs are written to process the data in HDFS. These programs consist of two main components: mappers and reducers.
- Mapper: The mapper takes input data, processes it, and produces output data in the form of key-value pairs.
- Reducer: The reducer takes the output from the mapper, aggregates it, and produces the final output.
- Output: The final output is stored in HDFS or other storage systems.
In summary, Hadoop is a powerful big data processing framework that provides a scalable, flexible, and cost-effective solution for processing large datasets. Its ecosystem of tools and technologies provides additional functionality and support for Hadoop, making it a popular choice for big data processing and analytics.
Key Components of Hadoop
- Hadoop Distributed File System (HDFS)
- A distributed file system designed to run on commodity hardware.
- Highly fault-tolerant and designed to be deployed on low-cost hardware.
- Provides high throughput access to application data and is suitable for applications that have large datasets.
- MapReduce
- A programming model for processing large datasets with a parallel, distributed algorithm on a cluster.
- Composed of two main functions:
- Map: Takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
- Reduce: Takes the output from a map as input and combines those data tuples into a smaller set of tuples.
- Hadoop Common
- The common utilities that support the other Hadoop modules.
- Includes libraries and utilities needed by other Hadoop modules.
- YARN (Yet Another Resource Negotiator)
- A resource-management platform responsible for managing computing resources in clusters and using them for scheduling users’ applications.
- Decouples resource management and job scheduling/monitoring functions into separate daemons.
Hadoop Ecosystem Tools and technologies
- Hive
- A data warehouse infrastructure built on top of Hadoop.
- Provides data summarization, query, and analysis.
- Uses HiveQL, a SQL-like language.
- Pig
- A high-level platform for creating MapReduce programs used with Hadoop.
- Uses a language called Pig Latin, which abstracts the programming from the Java MapReduce idiom.
- HBase
- A distributed, scalable, big data store.
- Runs on top of HDFS.
- Provides random, real-time read/write access to big data.
- Sqoop
- A tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
- Flume
- A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
- Oozie
- A workflow scheduler system to manage Hadoop jobs.
- Zookeeper
- A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Spark
- A fast and general-purpose cluster-computing system.
- Provides in-memory processing to speed up big data analysis.
Hadoop Architecture
Hadoop follows a Master-Slave architecture for both data storage and data processing.
- HDFS Architecture:
- NameNode (Master):
- Manages the file system namespace and regulates access to files by clients.
- Maintains the file system tree and metadata for all the files and directories in the tree.
- DataNode (Slave):
- Responsible for storing the actual data in HDFS.
- Serves read and write requests from the file system’s clients.
- NameNode (Master):
- MapReduce Architecture:
- JobTracker (Master):
- Manages resources and job scheduling.
- TaskTracker (Slave):
- Executes the tasks as directed by the JobTracker.
- JobTracker (Master):
- YARN Architecture:
- ResourceManager (Master):
- Manages resources and scheduling.
- NodeManager (Slave):
- Manages resources on a single node.
- ResourceManager (Master):
Hadoop Workflow
- Data Ingestion:
- Data can be ingested into HDFS using tools like Flume, Sqoop, or by directly placing data into HDFS.
- Data Storage:
- Data is split into blocks and distributed across the cluster in HDFS.
- Replication ensures data reliability and fault tolerance.
- Data Processing:
- Using MapReduce or other processing engines like Spark, data is processed in parallel across the cluster.
- Intermediate data is stored in local disks and final output is stored back in HDFS.
- Data Analysis:
- Tools like Hive, Pig, or custom MapReduce jobs are used to query and analyze data.
- Results can be stored back in HDFS or exported to other systems.
Getting Started with Hadoop
- Installation:
- Hadoop can be installed in different modes:
- Standalone Mode: Default mode, mainly for debugging and development.
- Pseudo-Distributed Mode: All Hadoop services run on a single node.
- Fully Distributed Mode: Hadoop runs on multiple nodes, suitable for production.
- Hadoop can be installed in different modes:
- Configuration:
- Key configuration files include
core-site.xml
,hdfs-site.xml
,mapred-site.xml
, andyarn-site.xml
. - These files need to be correctly configured for Hadoop to run efficiently.
- Key configuration files include
- Running Hadoop Jobs:
- Hadoop jobs can be written in Java, Python (with Hadoop streaming), or other languages.
- Jobs are submitted using the Hadoop command line or APIs.
Example of Running a Hadoop MapReduce Job
WordCount Example:
Mapper (WordCountMapper.java):
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split("s+");
for (String str : words) {
word.set(str);
context.write(word, one);
}
}
}
Reducer (WordCountReducer.java):
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Driver (WordCount.java):
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Running the Job:hadoop jar wordcount.jar WordCount /input_dir /output_dir
Hadoop is a powerful framework for processing large datasets in a distributed environment. Understanding its core components, architecture, and how to run jobs is essential for leveraging its full potential in big data applications.