Joins, Union, Intersect, and Except in Pandas and PySpark
Below are the explanations and examples of these operations in Pandas and PySpark.
1. Joins
Pandas
import pandas as pd
# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'score': [85, 90, 95]})
# Inner Join
inner_join = pd.merge(df1, df2, on="id", how="inner")
print("Inner Join:\n", inner_join)
# Left Join
left_join = pd.merge(df1, df2, on="id", how="left")
print("Left Join:\n", left_join)
# Right Join
right_join = pd.merge(df1, df2, on="id", how="right")
print("Right Join:\n", right_join)
# Outer Join
outer_join = pd.merge(df1, df2, on="id", how="outer")
print("Outer Join:\n", outer_join)
PySpark
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("Joins").getOrCreate()
# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, 85), (3, 90), (4, 95)], ["id", "score"])
# Inner Join
inner_join = df1.join(df2, on="id", how="inner")
inner_join.show()
# Left Join
left_join = df1.join(df2, on="id", how="left")
left_join.show()
# Right Join
right_join = df1.join(df2, on="id", how="right")
right_join.show()
# Outer Join
outer_join = df1.join(df2, on="id", how="outer")
outer_join.show()
2. Union
Pandas
# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'id': [3, 4], 'name': ['Charlie', 'David']})
# Union
union_df = pd.concat([df1, df2]).reset_index(drop=True)
print("Union:\n", union_df)
PySpark
# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = spark.createDataFrame([(3, "Charlie"), (4, "David")], ["id", "name"])
# Union
union_df = df1.union(df2)
union_df.show()
3. Intersect
Pandas
Pandas does not have a direct method for intersection, but it can be achieved using merge
with inner
join.
pythonCopy code# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'name': ['Bob', 'Charlie', 'David']})
# Intersection
intersect_df = pd.merge(df1, df2)
print("Intersect:\n", intersect_df)
PySpark
# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "name"])
# Intersection
intersect_df = df1.intersect(df2)
intersect_df.show()
4. Except
Pandas
To find rows in one DataFrame and not in another:
# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'name': ['Bob', 'Charlie', 'David']})
# Except (Difference)
except_df = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
print("Except:\n", except_df)
PySpark
# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "name"])
# Except
except_df = df1.exceptAll(df2)
except_df.show()
Key Differences Between Pandas and PySpark
Feature | Pandas | PySpark |
---|---|---|
Scale | Best for small datasets (in-memory) | Scalable for big data (distributed) |
Performance | Single-threaded, not distributed | Distributed, optimized for large datasets |
Ease of Use | Simple syntax, Pythonic | Requires PySpark setup and more steps |
Direct Methods | Limited for operations like intersect | Direct methods for intersect and except |
These examples demonstrate how to perform the same operations in Pandas and PySpark, highlighting their syntax and best use cases.