PySpark Join Operations
In PySpark, join operations allow you to combine two DataFrames based on a specified condition. There are various types of joins, such as inner join, left join, right join, full outer join, semi join, and anti join, each serving specific use cases.
Syntax
DataFrame.join(other, on=None, how=None)
other
: The DataFrame to join with.on
: The column(s) to join on. It can be a column name, a list of column names, or a condition.how
: The type of join to perform. Default isinner
.
Types of Joins
inner
: Returns rows with matching keys in both DataFrames.left
orleft_outer
: Returns all rows from the left DataFrame and matching rows from the right.right
orright_outer
: Returns all rows from the right DataFrame and matching rows from the left.outer
,full
, orfull_outer
: Returns all rows from both DataFrames.left_semi
: Returns rows from the left DataFrame that have a match in the right DataFrame.left_anti
: Returns rows from the left DataFrame that do not have a match in the right DataFrame.
Example DataFrames
data1 = [
("Alice", 34, "HR"),
("Bob", 45, "IT"),
("Catherine", 29, "HR"),
("David", 36, "IT")
]
columns1 = ["Name", "Age", "Department"]
data2 = [
("HR", 3000),
("IT", 4000),
("Sales", 2800)
]
columns2 = ["Department", "Salary"]
df1 = spark.createDataFrame(data1, schema=columns1)
df2 = spark.createDataFrame(data2, schema=columns2)
df1.show()
df2.show()
Output: df1
:
+---------+---+----------+
| Name|Age|Department|
+---------+---+----------+
| Alice| 34| HR|
| Bob| 45| IT|
| Catherine| 29| HR|
| David| 36| IT|
+---------+---+----------+
df2
:
+----------+------+
|Department|Salary|
+----------+------+
| HR| 3000|
| IT| 4000|
| Sales| 2800|
+----------+------+
Join Operations Examples
1. Inner Join
Returns rows with matching Department
values in both DataFrames:
inner_join_df = df1.join(df2, on="Department", how="inner")
inner_join_df.show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Alice| 34| HR| 3000|
| Catherine| 29| HR| 3000|
| Bob| 45| IT| 4000|
| David| 36| IT| 4000|
+---------+---+----------+------+
2. Left Outer Join
Includes all rows from df1
and matches from df2
where possible:
left_join_df = df1.join(df2, on="Department", how="left")
left_join_df.show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Alice| 34| HR| 3000|
| Catherine| 29| HR| 3000|
| Bob| 45| IT| 4000|
| David| 36| IT| 4000|
+---------+---+----------+------+
3. Right Outer Join
Includes all rows from df2
and matches from df1
where possible:
right_join_df = df1.join(df2, on="Department", how="right")
right_join_df.show()
Output:
+---------+----+----------+------+
| Name| Age|Department|Salary|
+---------+----+----------+------+
| Alice| 34| HR| 3000|
| Catherine| 29| HR| 3000|
| Bob| 45| IT| 4000|
| David| 36| IT| 4000|
| null|null| Sales| 2800|
+---------+----+----------+------+
4. Full Outer Join
Includes all rows from both DataFrames:
outer_join_df = df1.join(df2, on="Department", how="outer")
outer_join_df.show()
Output:
+---------+----+----------+------+
| Name| Age|Department|Salary|
+---------+----+----------+------+
| Alice| 34| HR| 3000|
| Catherine| 29| HR| 3000|
| Bob| 45| IT| 4000|
| David| 36| IT| 4000|
| null|null| Sales| 2800|
+---------+----+----------+------+
5. Left Semi Join
Returns rows from df1
where there is a match in df2
:
semi_join_df = df1.join(df2, on="Department", how="left_semi")
semi_join_df.show()
Output:
+---------+---+----------+
| Name|Age|Department|
+---------+---+----------+
| Alice| 34| HR|
| Catherine| 29| HR|
| Bob| 45| IT|
| David| 36| IT|
+---------+---+----------+
6. Left Anti Join
Returns rows from df1
where there is no match in df2
:
anti_join_df = df1.join(df2, on="Department", how="left_anti")
anti_join_df.show()
Output:
+----+---+----------+
|Name|Age|Department|
+----+---+----------+
+----+---+----------+
7. Join on Multiple Columns
Join based on multiple columns:
data3 = [
("Alice", 34, "HR", 3000),
("Bob", 45, "IT", 4000),
("Catherine", 29, "HR", 5000),
("David", 36, "IT", 2500)
]
columns3 = ["Name", "Age", "Department", "Salary"]
data4 = [
("HR", 34, 3000, "Manager"),
("IT", 45, 4000, "Developer")
]
columns4 = ["Department", "Age", "Salary", "Role"]
df3 = spark.createDataFrame(data3, schema=columns3)
df4 = spark.createDataFrame(data4, schema=columns4)
multi_col_join_df = df3.join(df4, on=["Department", "Age", "Salary"], how="inner")
multi_col_join_df.show()
Output:
+-----+---+----------+------+--------+
| Name|Age|Department|Salary| Role|
+-----+---+----------+------+--------+
|Alice| 34| HR| 3000| Manager|
| Bob| 45| IT| 4000|Developer|
+-----+---+----------+------+--------+
Summary
PySpark provides flexible options for joining DataFrames. Choose the appropriate join type based on your data requirements:
- Use
inner
for common data between DataFrames. - Use
left
,right
, orouter
for inclusive joins. - Use
semi
oranti
for filtering rows based on matching conditions.
Discover more from HintsToday
Subscribe to get the latest posts sent to your email.