Comparison Between Pandas and PySpark for Data Analysis

Joins, Union, Intersect, and Except in Pandas and PySpark

Below are the explanations and examples of these operations in Pandas and PySpark.

1. Joins

Pandas

import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'score': [85, 90, 95]})

# Inner Join
inner_join = pd.merge(df1, df2, on="id", how="inner")
print("Inner Join:\n", inner_join)

# Left Join
left_join = pd.merge(df1, df2, on="id", how="left")
print("Left Join:\n", left_join)

# Right Join
right_join = pd.merge(df1, df2, on="id", how="right")
print("Right Join:\n", right_join)

# Outer Join
outer_join = pd.merge(df1, df2, on="id", how="outer")
print("Outer Join:\n", outer_join)

PySpark

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("Joins").getOrCreate()

# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, 85), (3, 90), (4, 95)], ["id", "score"])

# Inner Join
inner_join = df1.join(df2, on="id", how="inner")
inner_join.show()

# Left Join
left_join = df1.join(df2, on="id", how="left")
left_join.show()

# Right Join
right_join = df1.join(df2, on="id", how="right")
right_join.show()

# Outer Join
outer_join = df1.join(df2, on="id", how="outer")
outer_join.show()

2. Union

Pandas

# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'id': [3, 4], 'name': ['Charlie', 'David']})

# Union
union_df = pd.concat([df1, df2]).reset_index(drop=True)
print("Union:\n", union_df)

PySpark

# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = spark.createDataFrame([(3, "Charlie"), (4, "David")], ["id", "name"])

# Union
union_df = df1.union(df2)
union_df.show()

3. Intersect

Pandas

Pandas does not have a direct method for intersection, but it can be achieved using merge with inner join.

pythonCopy code# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'name': ['Bob', 'Charlie', 'David']})

# Intersection
intersect_df = pd.merge(df1, df2)
print("Intersect:\n", intersect_df)

PySpark

# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "name"])

# Intersection
intersect_df = df1.intersect(df2)
intersect_df.show()

4. Except

Pandas

To find rows in one DataFrame and not in another:

# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'name': ['Bob', 'Charlie', 'David']})

# Except (Difference)
except_df = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
print("Except:\n", except_df)

PySpark

# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "name"])

# Except
except_df = df1.exceptAll(df2)
except_df.show()

Key Differences Between Pandas and PySpark

Feature	Pandas	PySpark
Scale	Best for small datasets (in-memory)	Scalable for big data (distributed)
Performance	Single-threaded, not distributed	Distributed, optimized for large datasets
Ease of Use	Simple syntax, Pythonic	Requires PySpark setup and more steps
Direct Methods	Limited for operations like intersect	Direct methods for intersect and except

These examples demonstrate how to perform the same operations in Pandas and PySpark, highlighting their syntax and best use cases.

Pages: 1 2 3 4 5 6 7 8

Joins, Union, Intersect, and Except in Pandas and PySpark

1. Joins

Pandas

PySpark

2. Union

Pandas

PySpark

3. Intersect

Pandas

PySpark

4. Except

Pandas

PySpark

Key Differences Between Pandas and PySpark

Explore Our Tutorials

Connect With Us

About HintsToday

Success!