HintsToday

Hints and Answers for Everything

recent posts

about

PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

November 16, 2024

PySpark Architecture Cheat Sheet

1. Core Components of PySpark

Component	Description	Key Features
Spark Core	The foundational Spark component for scheduling, memory management, and fault tolerance.	Task scheduling, data partitioning, RDD APIs.
Spark SQL	Enables interaction with structured data via SQL, DataFrames, and Datasets.	Supports SQL queries, schema inference, integration with Hive.
Spark Streaming	Allows real-time data processing through micro-batching.	DStreams, integration with Kafka, Flume, etc.
Spark MLlib	Provides scalable machine learning algorithms.	Algorithms for classification, clustering, regression, etc.
Spark GraphX	Supports graph processing and analysis for complex networks.	Graph algorithms, graph-parallel computation.

2. PySpark Layered Architecture

Layer	Description	Key Functions
Application Layer	Contains user applications and custom PySpark code.	Custom data workflows and business logic.
Spark API Layer	Provides PySpark APIs to interact with Spark Core and other components.	High-level abstractions for data manipulation, SQL, streaming.
Spark Core Layer	Provides core functionalities including task scheduling and fault tolerance.	Data locality, memory management, RDD transformations.
Execution Layer	Manages the execution of Spark tasks across the cluster.	Task scheduling, load balancing, error handling.
Storage Layer	Manages data storage with distributed storage solutions.	Supports HDFS, S3, Cassandra, and other storage integrations.

3. Spark Cluster Architecture

Component	Description	Key Role
Driver Node	Runs the Spark application, creates Spark Context, and coordinates tasks execution.	Manages jobs, scheduling, and resource allocation.
Executor Nodes	Run tasks assigned by the driver node and process data on each partition.	Execute tasks, store intermediate results, and return output.
Cluster Manager	Manages the resources of the Spark cluster.	Allocates resources, manages executor lifecycle.
Distributed File System	Stores data across the cluster, ensuring high availability.	HDFS, S3, or other compatible storage for data sharing.

4. Spark Execution Process

Step	Description	Key Points
1. Application Submission	User submits a Spark application via command line or a UI.	Initializes job creation in Spark.
2. Job Creation	Spark creates jobs and splits them into stages and tasks based on transformations.	Directed Acyclic Graph (DAG) creation based on data flow.
3. Task Assignment	Driver assigns tasks to executors based on data locality and resource availability.	Utilizes data partitioning for parallelism.
4. Task Execution	Executors run assigned tasks on data partitions.	Processes data in parallel across the cluster.
5. Result Collection	Driver collects results from all executors, aggregates, and returns the final output.	Outputs final results to the user or designated storage.

5. Spark RDD Architecture

Component	Description	Key Characteristics
RDD (Resilient Distributed Dataset)	Immutable, distributed collection of objects across the cluster.	Fault-tolerant, lineage-based recovery, in-memory processing.
Partition	Subset of data within an RDD stored on a single node.	Enables parallel processing of data.
Task	Smallest unit of work that operates on a single partition.	Executes transformations or actions on data.

6. Spark DataFrame Architecture

Component	Description	Key Characteristics
DataFrame	Distributed collection of data organized into named columns.	Schema-based data handling, optimized storage, SQL compatibility.
Dataset	Strongly-typed distributed collection of data (Java/Scala).	Type safety, combines features of RDDs and DataFrames.
Encoder	Converts data between JVM objects and Spark’s internal format for optimized serialization.	Efficient serialization/deserialization for faster processing.

7. Spark SQL Architecture

Component	Description	Key Functions
Catalyst Optimizer	Optimizes Spark SQL queries for enhanced performance.	Logical plan, physical plan optimization.
Query Planner	Plans the execution of SQL queries by selecting the best execution strategy.	Converts optimized logical plan into physical execution plan.
Execution Engine	Executes SQL queries using Spark’s distributed computing framework.	Leverages cluster resources for parallel query execution.

8. Spark Streaming Architecture

Component	Description	Key Features
DStream (Discretized Stream)	Continuous data stream split into micro-batches for processing.	Batch processing in near real-time.
Receiver	Ingests data from external sources like Kafka, Flume, etc.	Acts as data source for streaming jobs.
Processor	Processes data within DStream by applying transformations and actions.	Provides transformations similar to RDDs.

9. Spark Master-Slave Architecture

Component	Description	Key Role
Master Node	Coordinates resource allocation, task scheduling, and overall job management.	Central controller for Spark cluster.
Worker Nodes	Execute tasks on assigned data partitions as directed by the Master.	Run computations, store data, and handle intermediate results.
Executor	Process-running unit on each worker node, responsible for executing tasks.	Runs task code, caches data, and sends results to the driver.
Task	Smallest unit of work on data partitions assigned to executors by the driver.	Executes transformations or actions on data partitions.
Driver Program	Initiates the Spark application and coordinates overall job execution.	Submits tasks to the master and receives results.
Cluster Manager	Manages resources for the entire Spark cluster (YARN, Mesos, Kubernetes).	Manages the lifecycle of executors and resource distribution.

10. Key Concepts in PySpark

Concept	Description	Benefits
Lazy Evaluation	Transformations are not executed until an action is called.	Optimizes query execution by grouping operations.
Fault Tolerance	Spark recovers lost RDDs using lineage information when nodes fail.	Increases reliability in distributed environments.
In-Memory Processing	Stores intermediate data in memory instead of writing to disk.	Enables faster data processing by avoiding I/O overhead.

11. Common Use Cases for PySpark

Batch Processing: Large-scale ETL (Extract, Transform, Load) jobs.
Stream Processing: Real-time analytics, monitoring systems.
Machine Learning: Training models at scale using Spark MLlib.
Graph Processing: Social network analysis, recommendation systems.
Data Warehousing: Leveraging Spark SQL for querying structured datasets.

More Parts of The Post:-

Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in Pyspark

Leave a ReplyCancel reply