PySpark Architecture Cheat Sheet
1. Core Components of PySpark
Component | Description | Key Features |
---|---|---|
Spark Core | The foundational Spark component for scheduling, memory management, and fault tolerance. | Task scheduling, data partitioning, RDD APIs. |
Spark SQL | Enables interaction with structured data via SQL, DataFrames, and Datasets. | Supports SQL queries, schema inference, integration with Hive. |
Spark Streaming | Allows real-time data processing through micro-batching. | DStreams, integration with Kafka, Flume, etc. |
Spark MLlib | Provides scalable machine learning algorithms. | Algorithms for classification, clustering, regression, etc. |
Spark GraphX | Supports graph processing and analysis for complex networks. | Graph algorithms, graph-parallel computation. |
2. PySpark Layered Architecture
Layer | Description | Key Functions |
---|---|---|
Application Layer | Contains user applications and custom PySpark code. | Custom data workflows and business logic. |
Spark API Layer | Provides PySpark APIs to interact with Spark Core and other components. | High-level abstractions for data manipulation, SQL, streaming. |
Spark Core Layer | Provides core functionalities including task scheduling and fault tolerance. | Data locality, memory management, RDD transformations. |
Execution Layer | Manages the execution of Spark tasks across the cluster. | Task scheduling, load balancing, error handling. |
Storage Layer | Manages data storage with distributed storage solutions. | Supports HDFS, S3, Cassandra, and other storage integrations. |
3. Spark Cluster Architecture
Component | Description | Key Role |
---|---|---|
Driver Node | Runs the Spark application, creates Spark Context, and coordinates tasks execution. | Manages jobs, scheduling, and resource allocation. |
Executor Nodes | Run tasks assigned by the driver node and process data on each partition. | Execute tasks, store intermediate results, and return output. |
Cluster Manager | Manages the resources of the Spark cluster. | Allocates resources, manages executor lifecycle. |
Distributed File System | Stores data across the cluster, ensuring high availability. | HDFS, S3, or other compatible storage for data sharing. |
4. Spark Execution Process
Step | Description | Key Points |
---|---|---|
1. Application Submission | User submits a Spark application via command line or a UI. | Initializes job creation in Spark. |
2. Job Creation | Spark creates jobs and splits them into stages and tasks based on transformations. | Directed Acyclic Graph (DAG) creation based on data flow. |
3. Task Assignment | Driver assigns tasks to executors based on data locality and resource availability. | Utilizes data partitioning for parallelism. |
4. Task Execution | Executors run assigned tasks on data partitions. | Processes data in parallel across the cluster. |
5. Result Collection | Driver collects results from all executors, aggregates, and returns the final output. | Outputs final results to the user or designated storage. |
5. Spark RDD Architecture
Component | Description | Key Characteristics |
---|---|---|
RDD (Resilient Distributed Dataset) | Immutable, distributed collection of objects across the cluster. | Fault-tolerant, lineage-based recovery, in-memory processing. |
Partition | Subset of data within an RDD stored on a single node. | Enables parallel processing of data. |
Task | Smallest unit of work that operates on a single partition. | Executes transformations or actions on data. |
6. Spark DataFrame Architecture
Component | Description | Key Characteristics |
---|---|---|
DataFrame | Distributed collection of data organized into named columns. | Schema-based data handling, optimized storage, SQL compatibility. |
Dataset | Strongly-typed distributed collection of data (Java/Scala). | Type safety, combines features of RDDs and DataFrames. |
Encoder | Converts data between JVM objects and Spark’s internal format for optimized serialization. | Efficient serialization/deserialization for faster processing. |
7. Spark SQL Architecture
Component | Description | Key Functions |
---|---|---|
Catalyst Optimizer | Optimizes Spark SQL queries for enhanced performance. | Logical plan, physical plan optimization. |
Query Planner | Plans the execution of SQL queries by selecting the best execution strategy. | Converts optimized logical plan into physical execution plan. |
Execution Engine | Executes SQL queries using Spark’s distributed computing framework. | Leverages cluster resources for parallel query execution. |
8. Spark Streaming Architecture
Component | Description | Key Features |
---|---|---|
DStream (Discretized Stream) | Continuous data stream split into micro-batches for processing. | Batch processing in near real-time. |
Receiver | Ingests data from external sources like Kafka, Flume, etc. | Acts as data source for streaming jobs. |
Processor | Processes data within DStream by applying transformations and actions. | Provides transformations similar to RDDs. |
9. Spark Master-Slave Architecture
Component | Description | Key Role |
---|---|---|
Master Node | Coordinates resource allocation, task scheduling, and overall job management. | Central controller for Spark cluster. |
Worker Nodes | Execute tasks on assigned data partitions as directed by the Master. | Run computations, store data, and handle intermediate results. |
Executor | Process-running unit on each worker node, responsible for executing tasks. | Runs task code, caches data, and sends results to the driver. |
Task | Smallest unit of work on data partitions assigned to executors by the driver. | Executes transformations or actions on data partitions. |
Driver Program | Initiates the Spark application and coordinates overall job execution. | Submits tasks to the master and receives results. |
Cluster Manager | Manages resources for the entire Spark cluster (YARN, Mesos, Kubernetes). | Manages the lifecycle of executors and resource distribution. |
10. Key Concepts in PySpark
Concept | Description | Benefits |
---|---|---|
Lazy Evaluation | Transformations are not executed until an action is called. | Optimizes query execution by grouping operations. |
Fault Tolerance | Spark recovers lost RDDs using lineage information when nodes fail. | Increases reliability in distributed environments. |
In-Memory Processing | Stores intermediate data in memory instead of writing to disk. | Enables faster data processing by avoiding I/O overhead. |
11. Common Use Cases for PySpark
- Batch Processing: Large-scale ETL (Extract, Transform, Load) jobs.
- Stream Processing: Real-time analytics, monitoring systems.
- Machine Learning: Training models at scale using Spark MLlib.
- Graph Processing: Social network analysis, recommendation systems.
- Data Warehousing: Leveraging Spark SQL for querying structured datasets.
Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors
Understanding how PySpark scripts execute across different nodes in a cluster is crucial for optimization and debugging. Here’s a breakdown of how to identify which parts of your script run on the driver, master/YARN, executors, or NameNodes:
Driver:
- Script initialization: SparkSession creation, configuration, and setting up the Spark context.
- Data ingestion: Reading data from sources (e.g., files, databases).
- Data transformation definitions: Defining transformations (e.g., maps, filters, joins).
- Action execution: Calling actions (e.g.,
count()
,show()
,write()
).
Executor:
- Task execution: Executing tasks assigned by the driver (e.g., mapping, filtering, joining).
- Data processing: Processing data in parallel across executor nodes.
- Shuffle operations: Exchanging data between executors during shuffle operations.
Master/YARN:
- Resource management: Managing resources (e.g., memory, CPU) for the Spark application.
- Job scheduling: Scheduling jobs and tasks on executor nodes.
- Monitoring: Tracking application progress and performance.
NameNode (HDFS):
- Data storage: Storing data in HDFS.
- Metadata management: Managing file system metadata.
To identify which parts of your script run on each component, follow these steps:
Step 1: Enable Spark UI
Add the following configuration to your SparkSession:
spark = SparkSession.builder \
.appName("Your App Name") \
.config("spark.ui.enabled", True) \
.config("spark.ui.port", 4040) \
.getOrCreate()
Step 2: Analyze Spark UI
Access the Spark UI at http://driver-node-ip:4040
:
- Jobs: View job execution details, including task execution times.
- Stages: Analyze stage execution, including shuffle operations.
- Tasks: Examine task execution on individual executor nodes.
Step 3: Use Debugging Tools
- Spark logs: Inspect logs for driver, executor, and master nodes.
- Print statements: Add print statements to your script to track execution.
Real-Life Use Case:
Suppose you have a PySpark script that:
# Driver: Initialize SparkSession and read data
spark = SparkSession.builder.appName("ETL").getOrCreate()
data = spark.read.parquet("input_data")
# Driver: Define data transformations
transformed_data = data.filter(data["age"] > 18).join(data2, "id")
# Executor: Execute transformations and shuffle operations
transformed_data = transformed_data.repartition(10)
# Driver: Execute action and write data
transformed_data.write.parquet("output_data")
In this example:
- The driver initializes the SparkSession, reads data, defines transformations, and executes the action.
- The executor executes tasks assigned by the driver, including filtering, joining, and repartitioning.
- The master/YARN manages resources, schedules jobs, and monitors application progress.
By analyzing the Spark UI, logs, and debugging tools, you can gain insights into which parts of your script run on each component.
Leave a Reply