PySpark Architecture Cheat Sheet
1. Core Components of PySpark
Component | Description | Key Features |
---|---|---|
Spark Core | The foundational Spark component for scheduling, memory management, and fault tolerance. | Task scheduling, data partitioning, RDD APIs. |
Spark SQL | Enables interaction with structured data via SQL, DataFrames, and Datasets. | Supports SQL queries, schema inference, integration with Hive. |
Spark Streaming | Allows real-time data processing through micro-batching. | DStreams, integration with Kafka, Flume, etc. |
Spark MLlib | Provides scalable machine learning algorithms. | Algorithms for classification, clustering, regression, etc. |
Spark GraphX | Supports graph processing and analysis for complex networks. | Graph algorithms, graph-parallel computation. |
2. PySpark Layered Architecture
Layer | Description | Key Functions |
---|---|---|
Application Layer | Contains user applications and custom PySpark code. | Custom data workflows and business logic. |
Spark API Layer | Provides PySpark APIs to interact with Spark Core and other components. | High-level abstractions for data manipulation, SQL, streaming. |
Spark Core Layer | Provides core functionalities including task scheduling and fault tolerance. | Data locality, memory management, RDD transformations. |
Execution Layer | Manages the execution of Spark tasks across the cluster. | Task scheduling, load balancing, error handling. |
Storage Layer | Manages data storage with distributed storage solutions. | Supports HDFS, S3, Cassandra, and other storage integrations. |
3. Spark Cluster Architecture
Component | Description | Key Role |
---|---|---|
Driver Node | Runs the Spark application, creates Spark Context, and coordinates tasks execution. | Manages jobs, scheduling, and resource allocation. |
Executor Nodes | Run tasks assigned by the driver node and process data on each partition. | Execute tasks, store intermediate results, and return output. |
Cluster Manager | Manages the resources of the Spark cluster. | Allocates resources, manages executor lifecycle. |
Distributed File System | Stores data across the cluster, ensuring high availability. | HDFS, S3, or other compatible storage for data sharing. |
4. Spark Execution Process
Step | Description | Key Points |
---|---|---|
1. Application Submission | User submits a Spark application via command line or a UI. | Initializes job creation in Spark. |
2. Job Creation | Spark creates jobs and splits them into stages and tasks based on transformations. | Directed Acyclic Graph (DAG) creation based on data flow. |
3. Task Assignment | Driver assigns tasks to executors based on data locality and resource availability. | Utilizes data partitioning for parallelism. |
4. Task Execution | Executors run assigned tasks on data partitions. | Processes data in parallel across the cluster. |
5. Result Collection | Driver collects results from all executors, aggregates, and returns the final output. | Outputs final results to the user or designated storage. |
5. Spark RDD Architecture
Component | Description | Key Characteristics |
---|---|---|
RDD (Resilient Distributed Dataset) | Immutable, distributed collection of objects across the cluster. | Fault-tolerant, lineage-based recovery, in-memory processing. |
Partition | Subset of data within an RDD stored on a single node. | Enables parallel processing of data. |
Task | Smallest unit of work that operates on a single partition. | Executes transformations or actions on data. |
6. Spark DataFrame Architecture
Component | Description | Key Characteristics |
---|---|---|
DataFrame | Distributed collection of data organized into named columns. | Schema-based data handling, optimized storage, SQL compatibility. |
Dataset | Strongly-typed distributed collection of data (Java/Scala). | Type safety, combines features of RDDs and DataFrames. |
Encoder | Converts data between JVM objects and Spark’s internal format for optimized serialization. | Efficient serialization/deserialization for faster processing. |
7. Spark SQL Architecture
Component | Description | Key Functions |
---|---|---|
Catalyst Optimizer | Optimizes Spark SQL queries for enhanced performance. | Logical plan, physical plan optimization. |
Query Planner | Plans the execution of SQL queries by selecting the best execution strategy. | Converts optimized logical plan into physical execution plan. |
Execution Engine | Executes SQL queries using Spark’s distributed computing framework. | Leverages cluster resources for parallel query execution. |
8. Spark Streaming Architecture
Component | Description | Key Features |
---|---|---|
DStream (Discretized Stream) | Continuous data stream split into micro-batches for processing. | Batch processing in near real-time. |
Receiver | Ingests data from external sources like Kafka, Flume, etc. | Acts as data source for streaming jobs. |
Processor | Processes data within DStream by applying transformations and actions. | Provides transformations similar to RDDs. |
9. Spark Master-Slave Architecture
Component | Description | Key Role |
---|---|---|
Master Node | Coordinates resource allocation, task scheduling, and overall job management. | Central controller for Spark cluster. |
Worker Nodes | Execute tasks on assigned data partitions as directed by the Master. | Run computations, store data, and handle intermediate results. |
Executor | Process-running unit on each worker node, responsible for executing tasks. | Runs task code, caches data, and sends results to the driver. |
Task | Smallest unit of work on data partitions assigned to executors by the driver. | Executes transformations or actions on data partitions. |
Driver Program | Initiates the Spark application and coordinates overall job execution. | Submits tasks to the master and receives results. |
Cluster Manager | Manages resources for the entire Spark cluster (YARN, Mesos, Kubernetes). | Manages the lifecycle of executors and resource distribution. |
10. Key Concepts in PySpark
Concept | Description | Benefits |
---|---|---|
Lazy Evaluation | Transformations are not executed until an action is called. | Optimizes query execution by grouping operations. |
Fault Tolerance | Spark recovers lost RDDs using lineage information when nodes fail. | Increases reliability in distributed environments. |
In-Memory Processing | Stores intermediate data in memory instead of writing to disk. | Enables faster data processing by avoiding I/O overhead. |
11. Common Use Cases for PySpark
- Batch Processing: Large-scale ETL (Extract, Transform, Load) jobs.
- Stream Processing: Real-time analytics, monitoring systems.
- Machine Learning: Training models at scale using Spark MLlib.
- Graph Processing: Social network analysis, recommendation systems.
- Data Warehousing: Leveraging Spark SQL for querying structured datasets.
More Parts of The Post:-
Leave a Reply