PySpark Join Operations

In PySpark, join operations allow you to combine two DataFrames based on a specified condition. There are various types of joins, such as inner join, left join, right join, full outer join, semi join, and anti join, each serving specific use cases.


Syntax

DataFrame.join(other, on=None, how=None)
  • other: The DataFrame to join with.
  • on: The column(s) to join on. It can be a column name, a list of column names, or a condition.
  • how: The type of join to perform. Default is inner.

Types of Joins

  • inner: Returns rows with matching keys in both DataFrames.
  • left or left_outer: Returns all rows from the left DataFrame and matching rows from the right.
  • right or right_outer: Returns all rows from the right DataFrame and matching rows from the left.
  • outer, full, or full_outer: Returns all rows from both DataFrames.
  • left_semi: Returns rows from the left DataFrame that have a match in the right DataFrame.
  • left_anti: Returns rows from the left DataFrame that do not have a match in the right DataFrame.

Example DataFrames

data1 = [
    ("Alice", 34, "HR"),
    ("Bob", 45, "IT"),
    ("Catherine", 29, "HR"),
    ("David", 36, "IT")
]
columns1 = ["Name", "Age", "Department"]

data2 = [
    ("HR", 3000),
    ("IT", 4000),
    ("Sales", 2800)
]
columns2 = ["Department", "Salary"]

df1 = spark.createDataFrame(data1, schema=columns1)
df2 = spark.createDataFrame(data2, schema=columns2)

df1.show()
df2.show()

Output: df1:

+---------+---+----------+
|     Name|Age|Department|
+---------+---+----------+
|    Alice| 34|        HR|
|      Bob| 45|        IT|
| Catherine| 29|        HR|
|     David| 36|        IT|
+---------+---+----------+

df2:

+----------+------+
|Department|Salary|
+----------+------+
|        HR|  3000|
|        IT|  4000|
|     Sales|  2800|
+----------+------+

Join Operations Examples

1. Inner Join

Returns rows with matching Department values in both DataFrames:

inner_join_df = df1.join(df2, on="Department", how="inner")
inner_join_df.show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
|    Alice| 34|        HR|  3000|
| Catherine| 29|        HR|  3000|
|      Bob| 45|        IT|  4000|
|     David| 36|        IT|  4000|
+---------+---+----------+------+

2. Left Outer Join

Includes all rows from df1 and matches from df2 where possible:

left_join_df = df1.join(df2, on="Department", how="left")
left_join_df.show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
|    Alice| 34|        HR|  3000|
| Catherine| 29|        HR|  3000|
|      Bob| 45|        IT|  4000|
|     David| 36|        IT|  4000|
+---------+---+----------+------+

3. Right Outer Join

Includes all rows from df2 and matches from df1 where possible:

right_join_df = df1.join(df2, on="Department", how="right")
right_join_df.show()

Output:

+---------+----+----------+------+
|     Name| Age|Department|Salary|
+---------+----+----------+------+
|    Alice|  34|        HR|  3000|
| Catherine|  29|        HR|  3000|
|      Bob|  45|        IT|  4000|
|     David|  36|        IT|  4000|
|     null|null|     Sales|  2800|
+---------+----+----------+------+

4. Full Outer Join

Includes all rows from both DataFrames:

outer_join_df = df1.join(df2, on="Department", how="outer")
outer_join_df.show()

Output:

+---------+----+----------+------+
|     Name| Age|Department|Salary|
+---------+----+----------+------+
|    Alice|  34|        HR|  3000|
| Catherine|  29|        HR|  3000|
|      Bob|  45|        IT|  4000|
|     David|  36|        IT|  4000|
|     null|null|     Sales|  2800|
+---------+----+----------+------+

5. Left Semi Join

Returns rows from df1 where there is a match in df2:

semi_join_df = df1.join(df2, on="Department", how="left_semi")
semi_join_df.show()

Output:

+---------+---+----------+
|     Name|Age|Department|
+---------+---+----------+
|    Alice| 34|        HR|
| Catherine| 29|        HR|
|      Bob| 45|        IT|
|     David| 36|        IT|
+---------+---+----------+

6. Left Anti Join

Returns rows from df1 where there is no match in df2:

anti_join_df = df1.join(df2, on="Department", how="left_anti")
anti_join_df.show()

Output:

+----+---+----------+
|Name|Age|Department|
+----+---+----------+
+----+---+----------+

7. Join on Multiple Columns

Join based on multiple columns:

data3 = [
    ("Alice", 34, "HR", 3000),
    ("Bob", 45, "IT", 4000),
    ("Catherine", 29, "HR", 5000),
    ("David", 36, "IT", 2500)
]
columns3 = ["Name", "Age", "Department", "Salary"]

data4 = [
    ("HR", 34, 3000, "Manager"),
    ("IT", 45, 4000, "Developer")
]
columns4 = ["Department", "Age", "Salary", "Role"]

df3 = spark.createDataFrame(data3, schema=columns3)
df4 = spark.createDataFrame(data4, schema=columns4)

multi_col_join_df = df3.join(df4, on=["Department", "Age", "Salary"], how="inner")
multi_col_join_df.show()

Output:

+-----+---+----------+------+--------+
| Name|Age|Department|Salary|    Role|
+-----+---+----------+------+--------+
|Alice| 34|        HR|  3000| Manager|
|  Bob| 45|        IT|  4000|Developer|
+-----+---+----------+------+--------+

Summary

PySpark provides flexible options for joining DataFrames. Choose the appropriate join type based on your data requirements:

  • Use inner for common data between DataFrames.
  • Use left, right, or outer for inclusive joins.
  • Use semi or anti for filtering rows based on matching conditions.


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Pages ( 6 of 8 ): « Previous1 ... 45 6 78Next »

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading