PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

Understanding how PySpark scripts execute across different nodes in a cluster is crucial for optimization and debugging. Here’s a breakdown of how to identify which parts of your script run on the driver, master/YARN, executors, or NameNodes:

Driver:

Script initialization: SparkSession creation, configuration, and setting up the Spark context.
Data ingestion: Reading data from sources (e.g., files, databases).
Data transformation definitions: Defining transformations (e.g., maps, filters, joins).
Action execution: Calling actions (e.g., count(), show(), write()).

Executor:

Task execution: Executing tasks assigned by the driver (e.g., mapping, filtering, joining).
Data processing: Processing data in parallel across executor nodes.
Shuffle operations: Exchanging data between executors during shuffle operations.

Master/YARN:

Resource management: Managing resources (e.g., memory, CPU) for the Spark application.
Job scheduling: Scheduling jobs and tasks on executor nodes.
Monitoring: Tracking application progress and performance.

NameNode (HDFS):

Data storage: Storing data in HDFS.
Metadata management: Managing file system metadata.

To identify which parts of your script run on each component, follow these steps:

Step 1: Enable Spark UI

Add the following configuration to your SparkSession:

spark = SparkSession.builder 
    .appName("Your App Name") 
    .config("spark.ui.enabled", True) 
    .config("spark.ui.port", 4040) 
    .getOrCreate()

Step 2: Analyze Spark UI

Access the Spark UI at http://driver-node-ip:4040:

Jobs: View job execution details, including task execution times.
Stages: Analyze stage execution, including shuffle operations.
Tasks: Examine task execution on individual executor nodes.

Step 3: Use Debugging Tools

Spark logs: Inspect logs for driver, executor, and master nodes.
Print statements: Add print statements to your script to track execution.

Real-Life Use Case:

Suppose you have a PySpark script that:

# Driver: Initialize SparkSession and read data
spark = SparkSession.builder.appName("ETL").getOrCreate()
data = spark.read.parquet("input_data")

# Driver: Define data transformations
transformed_data = data.filter(data["age"] > 18).join(data2, "id")

# Executor: Execute transformations and shuffle operations
transformed_data = transformed_data.repartition(10)

# Driver: Execute action and write data
transformed_data.write.parquet("output_data")

In this example:

The driver initializes the SparkSession, reads data, defines transformations, and executes the action.
The executor executes tasks assigned by the driver, including filtering, joining, and repartitioning.
The master/YARN manages resources, schedules jobs, and monitors application progress.

By analyzing the Spark UI, logs, and debugging tools, you can gain insights into which parts of your script run on each component.

Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Pages: 1 2 3 4

Pages ( 2 of 4 ): « Previous 1 2 3 4 Next »

2 Comments

rohit on December 8, 2024 at 6:40 pm

is there any place where we can practice spark for Free

Team AHT on January 3, 2025 at 8:14 pm

try collab
Reply

PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

Step 1: Enable Spark UI

Step 2: Analyze Spark UI

Step 3: Use Debugging Tools

Real-Life Use Case:

Discover more from HintsToday

2 Comments

Submit a Comment Cancel reply

Search Forums

Forums

Subscribe to HintsToday via Email

Top Posts

Subscribe to Blog via Email

PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

Step 1: Enable Spark UI

Step 2: Analyze Spark UI

Step 3: Use Debugging Tools

Real-Life Use Case:

Share this- Make us Famous:-

Discover more from HintsToday

2 Comments

Submit a Comment Cancel reply

Search Forums

Forums

Subscribe to HintsToday via Email

Top Posts

Topics

Subscribe to Blog via Email

Discover more from HintsToday