Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

Understanding how PySpark scripts execute across different nodes in a cluster is crucial for optimization and debugging. Here’s a breakdown of how to identify which parts of your script run on the driver, master/YARN, executors, or NameNodes:

Driver:

  1. Script initialization: SparkSession creation, configuration, and setting up the Spark context.
  2. Data ingestion: Reading data from sources (e.g., files, databases).
  3. Data transformation definitions: Defining transformations (e.g., maps, filters, joins).
  4. Action execution: Calling actions (e.g., count(), show(), write()).

Executor:

  1. Task execution: Executing tasks assigned by the driver (e.g., mapping, filtering, joining).
  2. Data processing: Processing data in parallel across executor nodes.
  3. Shuffle operations: Exchanging data between executors during shuffle operations.

Master/YARN:

  1. Resource management: Managing resources (e.g., memory, CPU) for the Spark application.
  2. Job scheduling: Scheduling jobs and tasks on executor nodes.
  3. Monitoring: Tracking application progress and performance.

NameNode (HDFS):

  1. Data storage: Storing data in HDFS.
  2. Metadata management: Managing file system metadata.

To identify which parts of your script run on each component, follow these steps:

Step 1: Enable Spark UI

Add the following configuration to your SparkSession:

spark = SparkSession.builder \
    .appName("Your App Name") \
    .config("spark.ui.enabled", True) \
    .config("spark.ui.port", 4040) \
    .getOrCreate()

Step 2: Analyze Spark UI

Access the Spark UI at http://driver-node-ip:4040:

  1. Jobs: View job execution details, including task execution times.
  2. Stages: Analyze stage execution, including shuffle operations.
  3. Tasks: Examine task execution on individual executor nodes.

Step 3: Use Debugging Tools

  1. Spark logs: Inspect logs for driver, executor, and master nodes.
  2. Print statements: Add print statements to your script to track execution.

Real-Life Use Case:

Suppose you have a PySpark script that:

# Driver: Initialize SparkSession and read data
spark = SparkSession.builder.appName("ETL").getOrCreate()
data = spark.read.parquet("input_data")

# Driver: Define data transformations
transformed_data = data.filter(data["age"] > 18).join(data2, "id")

# Executor: Execute transformations and shuffle operations
transformed_data = transformed_data.repartition(10)

# Driver: Execute action and write data
transformed_data.write.parquet("output_data")

In this example:

  • The driver initializes the SparkSession, reads data, defines transformations, and executes the action.
  • The executor executes tasks assigned by the driver, including filtering, joining, and repartitioning.
  • The master/YARN manages resources, schedules jobs, and monitors application progress.

By analyzing the Spark UI, logs, and debugging tools, you can gain insights into which parts of your script run on each component.

Pages: 1 2

Pages ( 2 of 2 ): « Previous1 2

Discover more from AI HitsToday

Subscribe to get the latest posts sent to your email.

One response to “PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors”

  1. is there any place where we can practice spark for Free

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About the HintsToday

AI HintsToday is One Stop Adda to learn All about AI, Data, ML, Stat Learning, SAS, SQL, Python, Pyspark. AHT is Future!

Explore the Posts

Latest Comments

Latest posts

Discover more from AI HitsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading