Here’s a PySpark cheat sheet designed for a quick yet comprehensive revision of PySpark concepts, architecture, optimizations, commands, and common operations. This guide is structured to cover the essentials from architecture to data processing and Spark SQL.
1. PySpark Architecture Basics
Component
Description
Driver Program
Main program; responsible for creating SparkContext, defining transformations, and coordinating actions.
SparkContext
Core connection to the Spark cluster, used to create RDDs and DataFrames.
Cluster Manager
Manages resources across nodes (e.g., YARN, Mesos, or Standalone mode).
Executors
Worker nodes that execute tasks, store data, and return results.
Tasks
Individual units of work sent to executors by the driver program.
Job
Each action in Spark triggers a job, consisting of multiple stages and tasks.
Stage
Jobs are broken into stages at shuffle boundaries.
DAG Scheduler
Transforms jobs into a DAG (Directed Acyclic Graph) of stages and tasks.
Optimize joins for small tables using broadcast join.
Avoid Using Too Many Partitions
spark.sql.shuffle.partitions = 200
Set reasonable number of partitions for shuffling.
Partitioning on Keys
df.write.partitionBy("col").parquet("path")
Partition data by column to improve reading efficiency.
Avoid Data Skew
Use salting or avoid uneven distribution in join columns.
Prevent performance issues caused by skewed data.
Filter Early
Apply filters early to reduce data size for downstream ops.
Reduces data processing and memory usage.
8. Handling Missing Data
Operation
Command
Description
Drop Missing Values
df.na.drop()
Drop rows with any NaN values.
Fill Missing Values
df.na.fill(value)
Fill all NaN with a specific value.
Fill Specific Columns
df.na.fill({'col1': 0, 'col2': 'unknown'})
Fill NaN values for specific columns.
9. Window Functions
Syntax for Window Functions
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank
window_spec = Window.partitionBy("col1").orderBy("col2")
# Using row_number
df.withColumn("row_num", row_number().over(window_spec)).show()
# Using rank
df.withColumn("rank", rank().over(window_spec)).show()
Function
Description
row_number()
Assigns a unique number to each row within a window.
rank()
Ranks rows within a window, ties get same rank.
dense_rank()
Ranks rows with no gaps in ranking.
lag()
Accesses previous row’s value within a window.
lead()
Accesses next row’s value within a window.
10. Common PySpark Configuration Settings
Configuration
Description
spark.executor.memory
Allocates memory per executor (e.g., 2g).
spark.driver.memory
Allocates memory for the driver program.
spark.sql.shuffle.partitions
Sets number of partitions for shuffling data.
spark.executor.cores
Number of cores per executor.
spark.default.parallelism
Default number of partitions in RDDs.
spark.sql.autoBroadcastJoinThreshold
Sets size threshold for broadcast joins.
spark.sql.cache.serializer
Serializer to use for cached data.
spark.dynamicAllocation.enabled
Enables dynamic allocation of executors.
This cheat sheet offers a high-level overview of PySpark essentials, covering core concepts, commands, and common operations. It serves as a quick reference for PySpark development and optimizations.
Let me know if you need more specific examples or deeper explanations on any section!
Leave a Reply