PySpark cheat sheet

Here’s a PySpark cheat sheet designed for a quick yet comprehensive revision of PySpark concepts, architecture, optimizations, commands, and common operations. This guide is structured to cover the essentials from architecture to data processing and Spark SQL.

1. PySpark Architecture Basics

Component	Description
Driver Program	Main program; responsible for creating SparkContext, defining transformations, and coordinating actions.
SparkContext	Core connection to the Spark cluster, used to create RDDs and DataFrames.
Cluster Manager	Manages resources across nodes (e.g., YARN, Mesos, or Standalone mode).
Executors	Worker nodes that execute tasks, store data, and return results.
Tasks	Individual units of work sent to executors by the driver program.
Job	Each action in Spark triggers a job, consisting of multiple stages and tasks.
Stage	Jobs are broken into stages at shuffle boundaries.
DAG Scheduler	Transforms jobs into a DAG (Directed Acyclic Graph) of stages and tasks.

2. Spark Submit Command

spark-submit 
  --master yarn 
  --deploy-mode cluster 
  --num-executors 10 
  --executor-cores 4 
  --executor-memory 16g 
  --driver-memory 16g 
  --conf spark.sql.shuffle.partitions=200 
  script.py

Parameter	Description
`--master`	Sets the cluster manager (e.g., `yarn`, `local`, `mesos`).
`--deploy-mode`	`client` or `cluster` (execution on client or cluster mode).
`--num-executors`	Number of executors to use.
`--executor-cores`	Number of CPU cores per executor.
`--executor-memory`	Memory per executor.
`--driver-memory`	Memory allocated for the driver.
`--conf`	Additional Spark configurations (e.g., `shuffle.partitions`).

3. DataFrame Basics and Operations

Creating a DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 1), ("Bob", 2)]
columns = ["Name", "Id"]
df = spark.createDataFrame(data, schema=columns)

Viewing Data

Command	Description
`df.show(n)`	Display first `n` rows of DataFrame.
`df.printSchema()`	Print schema of DataFrame.
`df.columns`	List columns of DataFrame.
`df.describe().show()`	Summary statistics for numerical columns.

Selecting Columns

df.select("Name", "Id").show()
df.select(df.Name, df.Id).show()

Filtering Data

df.filter(df["Id"] > 1).show()
df.filter("Id > 1").show()  # SQL-style string condition

Adding and Renaming Columns

from pyspark.sql.functions import lit

df = df.withColumn("Country", lit("USA"))
df = df.withColumnRenamed("Id", "Identifier")

Dropping Columns

df = df.drop("Country")

4. Data Transformations

Transformation	Example	Description
Map	`rdd.map(lambda x: x * 2)`	Applies a function to each element.
Filter	`rdd.filter(lambda x: x > 2)`	Filters elements based on condition.
FlatMap	`rdd.flatMap(lambda x: x.split(" "))`	Applies function and flattens the result.
Distinct	`df.distinct()`	Removes duplicates.
Union	`df.union(df2)`	Combines two DataFrames with the same schema.
GroupBy + Aggregation	`df.groupBy("col").count()`	Groups and aggregates data.

Aggregations

from pyspark.sql.functions import avg, sum, max, min

df.groupBy("Name").agg(
    sum("Id").alias("Total_Id"),
    avg("Id").alias("Average_Id")
).show()

Sorting

df.orderBy("Id", ascending=False).show()
df.sort("Id", "Name").show()

5. Spark SQL

Registering a DataFrame as a SQL Table

df.createOrReplaceTempView("people")

Executing SQL Queries

sql_df = spark.sql("SELECT * FROM people WHERE Id > 1")
sql_df.show()

Useful SQL Commands

Command	Description
`SELECT col FROM table`	Select specific columns from a table.
`WHERE condition`	Filter rows based on condition.
`GROUP BY col`	Group rows based on a column.
`ORDER BY col`	Sort rows by column.
`JOIN`	Perform SQL joins (INNER, LEFT, RIGHT, FULL).
`LIMIT n`	Limit the number of rows in the output.

Pivoting and Melting (Unpivoting)

df.groupBy("Name").pivot("Year").sum("Sales").show()

6. PySpark Functions for Data Manipulation

Function	Description
`col("col_name")`	Access column in DataFrame.
`lit(value)`	Creates a column with a literal value.
`when(condition, value)`	Conditional statement similar to `if` in SQL.
`isNull()`, `isNotNull()`	Check if column values are null/not null.
`alias("alias_name")`	Rename column in query.
`concat(col1, col2)`	Concatenate multiple columns.
`regexp_replace(col, pattern, replacement)`	Replace regex matches.

7. Optimization Techniques

Technique	Command	Description
Persist/Cache	`df.cache()`, `df.persist()`	Cache or persist DataFrame in memory for reuse.
Repartitioning	`df.repartition(num_partitions)`	Increase or reduce number of partitions.
Coalesce	`df.coalesce(num_partitions)`	Reduce number of partitions without shuffle.
Broadcast Join	`broadcast(df)`	Optimize joins for small tables using broadcast join.
Avoid Using Too Many Partitions	`spark.sql.shuffle.partitions = 200`	Set reasonable number of partitions for shuffling.
Partitioning on Keys	`df.write.partitionBy("col").parquet("path")`	Partition data by column to improve reading efficiency.
Avoid Data Skew	Use salting or avoid uneven distribution in join columns.	Prevent performance issues caused by skewed data.
Filter Early	Apply filters early to reduce data size for downstream ops.	Reduces data processing and memory usage.

8. Handling Missing Data

Operation	Command	Description
Drop Missing Values	`df.na.drop()`	Drop rows with any `NaN` values.
Fill Missing Values	`df.na.fill(value)`	Fill all `NaN` with a specific value.
Fill Specific Columns	`df.na.fill({'col1': 0, 'col2': 'unknown'})`	Fill `NaN` values for specific columns.

9. Window Functions

Syntax for Window Functions

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank

window_spec = Window.partitionBy("col1").orderBy("col2")

# Using row_number
df.withColumn("row_num", row_number().over(window_spec)).show()

# Using rank
df.withColumn("rank", rank().over(window_spec)).show()

Function	Description
`row_number()`	Assigns a unique number to each row within a window.
`rank()`	Ranks rows within a window, ties get same rank.
`dense_rank()`	Ranks rows with no gaps in ranking.
`lag()`	Accesses previous row’s value within a window.
`lead()`	Accesses next row’s value within a window.

10. Common PySpark Configuration Settings

Configuration	Description
`spark.executor.memory`	Allocates memory per executor (e.g., `2g`).
`spark.driver.memory`	Allocates memory for the driver program.
`spark.sql.shuffle.partitions`	Sets number of partitions for shuffling data.
`spark.executor.cores`	Number of cores per executor.
`spark.default.parallelism`	Default number of partitions in RDDs.
`spark.sql.autoBroadcastJoinThreshold`	Sets size threshold for broadcast joins.
`spark.sql.cache.serializer`	Serializer to use for cached data.
`spark.dynamicAllocation.enabled`	Enables dynamic allocation of executors.

This cheat sheet offers a high-level overview of PySpark essentials, covering core concepts, commands, and common operations. It serves as a quick reference for PySpark development and optimizations.

Let me know if you need more specific examples or deeper explanations on any section!

Pages ( 12 of 12 ): « Previous 1 … 10 11 12

Pyspark Dataframes Programming

Discover more from AI HitsToday

Subscribe to get the latest posts sent to your email.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About the HintsToday

AI HintsToday is One Stop Adda to learn All about AI, Data, ML, Stat Learning, SAS, SQL, Python, Pyspark. AHT is Future!

Explore the Posts

Latest Comments

Amit on CPU Cores, executors, executor memory in pyspark- Explain Memory Management in PysparkDecember 9, 2024
Good Post but diagrams or chart wise explaining would have been better.
Team AHT on Database Structures,managing Databases, Schemas, TablespacesDecember 9, 2024
sure will do..
Big Data, Data Warehouse, Data Lakes, Big Data Lake – Explain in simple words on BDL Ecosystem-HDFS and Hive TablesDecember 9, 2024
[…] BDL Ecosystem-HDFS and Hive Tables […]
rohit on PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executorsDecember 8, 2024
is there any place where we can practice spark for Free
SheepHater87381- on CRUD in SQL – Create Database, Create Table, Insert, Select, Update, Alter table, DeleteDecember 1, 2024
CRUD operations are game-changers for managing data! And seriously, indexing can boost your query speed.
Rajeev on Spark SQL windows Function and Best UsecasesNovember 30, 2024
Good Content and Good looking website it is!
LearnMaven121 on CRUD in SQL – Create Database, Create Table, Insert, Select, Update, Alter table, DeleteNovember 23, 2024
Indexing can really speed things up, and foreign keys help keep everything in check. Have you checked out some online…
Frostcicle1 on CRUD in SQL – Create Database, Create Table, Insert, Select, Update, Alter table, DeleteNovember 9, 2024
It’s cool to see how CRUD operations in SQL cover creating databases and tables! I think indexing and foreign keys…
96jazzycoder7 on Sorting Algorithms implemented in Python- Merge Sort, Bubble Sort, Quick SortOctober 11, 2024
Merge sort is like a sorting wizard that splits your list, sorts each half, and merges them back. It’s way…
adarsh on Python libraries and functions to manipulate dates and timesOctober 8, 2024
You could add a comparison table showing the differences between libraries like datetime, pytz, and pendulum. This could help users…
Cheers4Life on Pyspark RDDs a Wonder -Transformations, actions and execution operations- please explain and list themSeptember 19, 2024
Isn’t it fascinating how RDD transformations feel like magic, creating new data without altering the originals?
FuzzyBear7 on List out Performance optimization steps for SQLSeptember 4, 2024
You know, a great way to boost your SQL performance is by indexing columns you often use in WHERE clauses…

PySpark cheat sheet

1. PySpark Architecture Basics

2. Spark Submit Command

3. DataFrame Basics and Operations

Creating a DataFrame

Viewing Data

Selecting Columns

Filtering Data

Adding and Renaming Columns

Dropping Columns

4. Data Transformations

Aggregations

Sorting

5. Spark SQL

Registering a DataFrame as a SQL Table

Executing SQL Queries

Useful SQL Commands

Pivoting and Melting (Unpivoting)

6. PySpark Functions for Data Manipulation

7. Optimization Techniques

8. Handling Missing Data

9. Window Functions

Syntax for Window Functions

10. Common PySpark Configuration Settings

Share this:

Like this:

Discover more from AI HitsToday

Leave a Reply

Discover more from AI HitsToday