PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

How PySpark executes your script in a distributed environment.

Supppose i have a script with configuration files as Python dictionaries, Python variables as parameters, and a Python loop to process a metadata table row by row.

When a PySpark script is executed in a distributed environment, the execution process involves a clear distinction between operations handled on the driver and those performed by executors. Let’s break this down by considering the context provided: a script with configuration files as Python dictionaries, Python variables as parameters, and a Python loop to process a metadata table row by row.

1. Driver and Initial Code Execution

The driver is the central node where your PySpark application is initiated. It is responsible for:

Interpreting the PySpark script.
Initializing the Spark session and creating the SparkContext.
Reading Python objects (like dictionaries for configuration files) and resolving Python variables.

What Happens at the Driver:

Configuration dictionaries and variables: Python dictionaries and script-level variables are resolved and initialized at the driver. These variables and configurations are available globally to the application.
Metadata table reading: If the metadata table is read row by row using Python loops (e.g., for loops), the loop executes on the driver because it is plain Python logic.

2. Parallelization of Work

When operations involve transformations or actions on distributed datasets (e.g., RDDs or DataFrames), the driver orchestrates the parallelization by:

Creating logical execution plans for transformations (e.g., filter, map, join).
Splitting these operations into stages, which are collections of tasks.
Sending serialized tasks to the cluster’s worker nodes (executors) for parallel execution.

3. Code Execution Perspective

Script Components:

Python Dictionaries and Variables:
- Resolved and stored on the driver.
- If needed on executors (e.g., in transformations), they are serialized and sent to the executors.
Python Loops:
- Loops that operate on plain Python objects (e.g., for row in metadata_table) execute only on the driver.
- For distributed operations (e.g., processing a distributed DataFrame row by row), operations inside the loop can generate tasks for executors.
Row-Wise Transformations:
- When the Python script applies transformations (e.g., map, filter, reduce), the driver translates these into tasks and distributes them to executors.
- Executors perform transformations in parallel on data partitions.

Execution Flow Example:

Example Code:

from pyspark.sql import SparkSession

# Step 1: Initialize SparkSession (Driver)
spark = SparkSession.builder.master("yarn").appName("DistributedExecution").getOrCreate()

# Step 2: Configurations (Resolved on Driver)
config = {"threshold": 50, "filter_column": "age"}

# Step 3: Read Metadata Table (Driver)
metadata = [{"table_name": "sales", "filter_value": 30}, {"table_name": "customers", "filter_value": 40}]
for row in metadata:  # Python loop on Driver
    table_name = row["table_name"]
    filter_value = row["filter_value"]

    # Step 4: Distributed Processing (Executors)
    df = spark.read.table(table_name)
    result = (
        df.filter(df[config["filter_column"]] > filter_value)  # Task created for Executors
          .select("name", "age")
          .count()  # Action triggers execution
    )
    print(f"Table: {table_name}, Count: {result}")

Execution Breakdown:

Driver Execution:
- Initializes the Spark session.
- Resolves the config dictionary and metadata Python variables.
- Executes the for loop to process the metadata list.
Task Distribution to Executors:
- For each table in metadata, the driver:
  - Reads the table (logical plan creation).
  - Defines transformations (e.g., filter and select) and creates a task.
  - Sends the task and relevant data to executors.
Executor Execution:
- Executors receive partitions of the table data.
- Apply the filter and select transformations on their respective partitions.
- Return results (e.g., counts) to the driver.

Key Takeaways:

Parameter Resolution: Python variables and dictionaries are resolved on the driver and serialized to executors only when needed for distributed operations.
Code Interpretation: The entire PySpark script is interpreted at the driver, but only distributed operations (e.g., DataFrame transformations) generate tasks for executors.
Executor Role: Executors independently process partitions of data in parallel, performing transformations and actions as instructed by the driver.
Driver-Executor Communication: The driver coordinates execution and aggregates results from executors.

This separation of responsibilities between the driver and executors ensures scalability and efficiency in a distributed environment.

Here’s a high-level overview of the execution process:

Driver Initialization: When you run your PySpark script, the driver program initializes. The driver is responsible for coordinating the execution of your script across the cluster.
Script Interpretation: The driver interprets your PySpark script, including any Python code, variables, and configurations.
Task Creation: When the driver encounters a PySpark action (e.g., df.show(), df.write(), etc.) or a transformation that requires data processing (e.g., df.map(), df.filter(), etc.), it creates tasks to execute those operations.
Task Distribution: The driver distributes these tasks across the executors in the cluster.
Task Execution: Each executor executes the tasks assigned to it. The executors may read data from storage, perform computations, and write data to storage.
Result Collection: The executors return the results of their tasks to the driver.
Result Processing: The driver processes the results, which may involve combining data from multiple tasks, performing additional computations, or returning the results to the user.

Now, let’s dive deeper into how your specific script is executed:

Suppose you have a PySpark script like this:

# config dictionary
config = {"input_path": "/data/input", "output_path": "/data/output"}

# python variables
num_partitions = 10

# python loop to read metadata table row by row
for row in metadata_table.collect():
    # create a DataFrame from the input data
    df = spark.read.parquet(config["input_path"] + "/" + row["table_name"])

    # perform some transformations
    df_transformed = df.repartition(num_partitions).withColumn("new_column", lit(1))

    # write the transformed DataFrame to output
    df_transformed.write.parquet(config["output_path"] + "/" + row["table_name"])

Here’s how this script is executed:

Driver Initialization: The driver initializes and interprets the script.
Config and Variable Resolution: The driver resolves the config dictionary and the num_partitions variable.
Metadata Table Collection: The driver collects the metadata table into a Python list using the collect() method.
Python Loop Execution: The driver executes the Python loop, iterating over each row in the metadata table.
Task Creation: For each iteration of the loop, the driver creates tasks to read the input data, perform transformations, and write the output data.
Task Distribution: The driver distributes these tasks across the executors in the cluster.
Task Execution: Each executor executes the tasks assigned to it, reading input data, performing transformations, and writing output data.
Result Collection: The executors return the results of their tasks to the driver.
Result Processing: The driver processes the results, which may involve combining data from multiple tasks or returning the results to the user.

Note that the driver executes the Python loop and creates tasks for each iteration. The tasks are then distributed across the executors, which execute the tasks in parallel. This allows your script to take advantage of the distributed computing capabilities of Spark.

HintsToday

recent posts

about