PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors
by Team AHT | Nov 16, 2024 | Pyspark |
PySpark Architecture Cheat Sheet
1. Core Components of PySpark
Component Description Key Features Spark Core The foundational Spark component for scheduling, memory management, and fault tolerance. Task scheduling, data partitioning, RDD APIs. Spark SQL Enables interaction with structured data via SQL, DataFrames, and Datasets. Supports SQL queries, schema inference, integration with Hive. Spark Streaming Allows real-time data processing through micro-batching. DStreams, integration with Kafka, Flume, etc. Spark MLlib Provides scalable machine learning algorithms. Algorithms for classification, clustering, regression, etc. Spark GraphX Supports graph processing and analysis for complex networks. Graph algorithms, graph-parallel computation.
2. PySpark Layered Architecture
Layer Description Key Functions Application Layer Contains user applications and custom PySpark code. Custom data workflows and business logic. Spark API Layer Provides PySpark APIs to interact with Spark Core and other components. High-level abstractions for data manipulation, SQL, streaming. Spark Core Layer Provides core functionalities including task scheduling and fault tolerance. Data locality, memory management, RDD transformations. Execution Layer Manages the execution of Spark tasks across the cluster. Task scheduling, load balancing, error handling. Storage Layer Manages data storage with distributed storage solutions. Supports HDFS, S3, Cassandra, and other storage integrations.
3. Spark Cluster Architecture
Component Description Key Role Driver Node Runs the Spark application, creates Spark Context, and coordinates tasks execution. Manages jobs, scheduling, and resource allocation. Executor Nodes Run tasks assigned by the driver node and process data on each partition. Execute tasks, store intermediate results, and return output. Cluster Manager Manages the resources of the Spark cluster. Allocates resources, manages executor lifecycle. Distributed File System Stores data across the cluster, ensuring high availability. HDFS, S3, or other compatible storage for data sharing.
4. Spark Execution Process
Step Description Key Points 1. Application Submission User submits a Spark application via command line or a UI. Initializes job creation in Spark. 2. Job Creation Spark creates jobs and splits them into stages and tasks based on transformations. Directed Acyclic Graph (DAG) creation based on data flow. 3. Task Assignment Driver assigns tasks to executors based on data locality and resource availability. Utilizes data partitioning for parallelism. 4. Task Execution Executors run assigned tasks on data partitions. Processes data in parallel across the cluster. 5. Result Collection Driver collects results from all executors, aggregates, and returns the final output. Outputs final results to the user or designated storage.
5. Spark RDD Architecture
Component Description Key Characteristics RDD (Resilient Distributed Dataset) Immutable, distributed collection of objects across the cluster. Fault-tolerant, lineage-based recovery, in-memory processing. Partition Subset of data within an RDD stored on a single node. Enables parallel processing of data. Task Smallest unit of work that operates on a single partition. Executes transformations or actions on data.
6. Spark DataFrame Architecture
Component Description Key Characteristics DataFrame Distributed collection of data organized into named columns. Schema-based data handling, optimized storage, SQL compatibility. Dataset Strongly-typed distributed collection of data (Java/Scala). Type safety, combines features of RDDs and DataFrames. Encoder Converts data between JVM objects and Spark’s internal format for optimized serialization. Efficient serialization/deserialization for faster processing.
7. Spark SQL Architecture
Component Description Key Functions Catalyst Optimizer Optimizes Spark SQL queries for enhanced performance. Logical plan, physical plan optimization. Query Planner Plans the execution of SQL queries by selecting the best execution strategy. Converts optimized logical plan into physical execution plan. Execution Engine Executes SQL queries using Spark’s distributed computing framework. Leverages cluster resources for parallel query execution.
8. Spark Streaming Architecture
Component Description Key Features DStream (Discretized Stream) Continuous data stream split into micro-batches for processing. Batch processing in near real-time. Receiver Ingests data from external sources like Kafka, Flume, etc. Acts as data source for streaming jobs. Processor Processes data within DStream by applying transformations and actions. Provides transformations similar to RDDs.
9. Spark Master-Slave Architecture
Component Description Key Role Master Node Coordinates resource allocation, task scheduling, and overall job management. Central controller for Spark cluster. Worker Nodes Execute tasks on assigned data partitions as directed by the Master. Run computations, store data, and handle intermediate results. Executor Process-running unit on each worker node, responsible for executing tasks. Runs task code, caches data, and sends results to the driver. Task Smallest unit of work on data partitions assigned to executors by the driver. Executes transformations or actions on data partitions. Driver Program Initiates the Spark application and coordinates overall job execution. Submits tasks to the master and receives results. Cluster Manager Manages resources for the entire Spark cluster (YARN, Mesos, Kubernetes). Manages the lifecycle of executors and resource distribution.
10. Key Concepts in PySpark
Concept Description Benefits Lazy Evaluation Transformations are not executed until an action is called. Optimizes query execution by grouping operations. Fault Tolerance Spark recovers lost RDDs using lineage information when nodes fail. Increases reliability in distributed environments. In-Memory Processing Stores intermediate data in memory instead of writing to disk. Enables faster data processing by avoiding I/O overhead.
11. Common Use Cases for PySpark
Batch Processing: Large-scale ETL (Extract, Transform, Load) jobs.
Stream Processing: Real-time analytics, monitoring systems.
Machine Learning: Training models at scale using Spark MLlib.
Graph Processing: Social network analysis, recommendation systems.
Data Warehousing: Leveraging Spark SQL for querying structured datasets.
More Parts of The Post:-
Discover more from HintsToday
Subscribe to get the latest posts sent to your email.
is there any place where we can practice spark for Free
try collab