Memory Management through Hadoop Traditional map reduce vs Pyspark- explained with example of Complex data pipeline used for Both used

Yup. We will discuss- Memory Management through Hadoop Traditional map reduce vs Pyspark- explained with example of Complex data pipeline used for Both.

Let’s delve into a detailed comparison of memory management between Hadoop Traditional MapReduce and PySpark, using a real-world example of a complex data pipeline for both frameworks.

Contents

1 Hadoop Traditional MapReduce
2 PySpark
3 Comparison Summary:
4 Share this:

Hadoop Traditional MapReduce

Real-World Example: Complex ETL Pipeline

Scenario: A data pipeline that processes web server logs to compute user session statistics, filter erroneous data, and generate aggregated reports.

Data Ingestion:
- Read raw logs from HDFS.
Data Cleaning:
- Filter out records with missing or malformed fields.
Sessionization:
- Group records by user and time interval to form sessions.
Aggregation:
- Compute session statistics such as total time spent and number of pages visited.

Memory Management in Hadoop MapReduce

Fixed Memory Allocation:
- Each task (map or reduce) is allocated a fixed amount of memory, configured via parameters like mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.
Intermediate Data Spilling:
- Intermediate results are spilled to disk when in-memory buffers reach a certain threshold (typically 80%).
Disk I/O:
- Heavy reliance on disk for intermediate data during the shuffle and sort phase.

Configuration:

<configuration>
  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>1024</value>
  </property>
  <property>
    <name>mapreduce.task.io.sort.factor</name>
    <value>100</value>
  </property>
</configuration>

Execution and Memory Usage:

Map Phase:
- Input splits processed by mappers, each allocated 2 GB memory.
- Intermediate key-value pairs stored in a 1 GB in-memory buffer.
- Buffer spills to disk when 80% full, resulting in frequent disk I/O.
Shuffle and Sort Phase:
- Intermediate data merged and sorted on disk.
- Significant disk I/O overhead due to lack of in-memory processing.
Reduce Phase:
- Each reducer allocated 4 GB memory.
- Fetches and processes intermediate data from mappers.
- Final results written back to HDFS.

PySpark

Real-World Example: Complex ETL Pipeline

Scenario: The same data pipeline as above.

Memory Management in PySpark

In-Memory Computation:
- Data stored in-memory using Resilient Distributed Datasets (RDDs) or DataFrames.
- Intermediate results cached in memory, reducing disk I/O.
Configurable Memory Management:
- Executor memory and cache persistence levels configurable.
- Dynamic memory management to balance between storage and execution memory.

Configuration:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Complex ETL Pipeline") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.memory.fraction", "0.8") \
    .getOrCreate()

Execution and Memory Usage:

Data Ingestion:
- Read raw logs from HDFS into a DataFrame.
df = spark.read.csv("hdfs:///path/to/input/*.csv", header=True, inferSchema=True)
Data Cleaning:
- Filter out erroneous records in-memory.
df_cleaned = df.filter(df["column"].isNotNull())
Sessionization:
- Group records by user and time interval, leveraging in-memory processing.
from pyspark.sql.functions import window df_sessions = df_cleaned.groupBy("user", window("timestamp", "30 minutes")).agg({"page": "count"})
Aggregation:
- Compute session statistics with cached intermediate results.
df_sessions.cache() df_aggregated = df_sessions.groupBy("user").agg({"session_duration": "sum", "page_count": "sum"})
Write Results:
- Output the results back to HDFS.
df_aggregated.write.parquet("hdfs:///path/to/output/")

Memory Usage Details:

Executor Memory:
- Each executor allocated 4 GB memory.
- Spark dynamically manages memory between storage (cached data) and execution (task processing).
In-Memory Processing:
- Intermediate results (e.g., cleaned data, sessionized data) stored in-memory.
- Caching reduces recomputation and minimizes disk I/O.
Memory Efficiency:
- Spark’s memory management allows efficient handling of large datasets with minimal spilling to disk.
- Executors can be dynamically allocated based on workload, improving resource utilization.

Comparison Summary:

Feature	Hadoop MapReduce	PySpark
Memory Allocation	Fixed per task (e.g., 2 GB for mappers)	Configurable executor memory (e.g., 4 GB)
Intermediate Data Handling	Spilled to disk when buffers are full	Cached in-memory, reduced disk I/O
Shuffle and Sort	Disk-based, I/O intensive	In-memory, optimized memory management
Data Caching	Not supported	Supported, reducing recomputation
Dynamic Resource Allocation	Not supported	Supported, efficient resource utilization
Execution Speed	Slower due to disk I/O	Faster due to in-memory computation

Hadoop Traditional MapReduce relies heavily on disk I/O for intermediate data management, leading to potential performance bottlenecks. Memory management is fixed and can result in frequent spills to disk. In contrast, PySpark utilizes in-memory computation, configurable memory management, and dynamic resource allocation, enabling faster data processing and more efficient memory usage. This makes PySpark more suitable for complex data pipelines, especially those requiring iterative operations and real-time data analysis.

AI HintsToday