PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

confused between driver, driver program, master node, yarm.. is master node is the one which initiates driver code or master node is resource manager

Here’s a breakdown of the different components:

Driver Program

The driver program is the actual Spark application code that you write.
It’s the program that creates a SparkContext, loads data, applies transformations, and initiates actions.
The driver program runs on the driver node.

Driver Node

The driver node is the node where the driver program runs.
It’s responsible for coordinating the execution of the Spark application.
The driver node is also responsible for maintaining the SparkContext and managing the Spark application’s lifecycle.

Master Node

The master node is the node that runs the resource manager (e.g., YARN, Mesos, or Spark Standalone).
The resource manager is responsible for managing the cluster’s resources and scheduling tasks.
The master node receives requests from the driver node to launch executors and allocate resources.

YARN (Yet Another Resource Negotiator)

YARN is a resource management framework that’s used in Hadoop clusters.
YARN provides a way to manage resources and schedule tasks in a distributed environment.
In a Spark application, YARN acts as the resource manager and manages the allocation of resources to the Spark application.

Spark Standalone

Spark Standalone is a simple cluster manager that comes bundled with Spark.
It provides a way to manage a Spark cluster without relying on a external resource manager like YARN or Mesos.

Here’s a high-level overview of how the different components interact:

The driver program creates a SparkContext and initiates the Spark application.
The driver node sends a request to the master node to launch executors and allocate resources.
The master node (running YARN or Spark Standalone) receives the request and manages the allocation of resources to the Spark application.
The executors are launched on the worker nodes, and the Spark application is executed in parallel across the cluster.

so driver node master node and workers( set of executores) are different

In a Spark cluster, the driver node, master node, and worker nodes (which run the executors) are typically separate entities.

Here’s a brief overview of each:

Driver Node

Runs the Spark driver program (i.e., your Spark application code)
Creates and manages the SparkContext
Coordinates the execution of tasks on the executors
Typically runs on a separate machine or node

Master Node

Runs the resource manager (e.g., YARN, Mesos, or Spark Standalone)
Manages the cluster’s resources (e.g., memory, CPU, etc.)
Schedules tasks and allocates resources to the Spark application
Typically runs on a separate machine or node

Worker Nodes (Executors)

Run the Spark executors, which execute tasks assigned by the driver node
Each executor runs on a separate machine or node (or even multiple executors on the same node)
Executors perform the actual computation and data processing

Here’s a simple illustration:

In this setup, the driver node and master node are typically separate, and the worker nodes (executors) are also separate. However, in some cases, the driver node and master node can be combined into a single node, or the executors can run on the same node as the driver or master node.

HintsToday

recent posts

about

confused between driver, driver program, master node, yarm.. is master node is the one which initiates driver code or master node is resource manager

Driver Program

Driver Node

Master Node

YARN (Yet Another Resource Negotiator)

Spark Standalone

so driver node master node and workers( set of executores) are different

Driver Node

Master Node

Worker Nodes (Executors)

Like this:

Discover more from HintsToday

Leave a ReplyCancel reply