Comparison Between Pandas and PySpark for Data Analysis



Joins, Union, Intersect, and Except in Pandas and PySpark

Below are the explanations and examples of these operations in Pandas and PySpark.


1. Joins

Pandas

import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'score': [85, 90, 95]})

# Inner Join
inner_join = pd.merge(df1, df2, on="id", how="inner")
print("Inner Join:\n", inner_join)

# Left Join
left_join = pd.merge(df1, df2, on="id", how="left")
print("Left Join:\n", left_join)

# Right Join
right_join = pd.merge(df1, df2, on="id", how="right")
print("Right Join:\n", right_join)

# Outer Join
outer_join = pd.merge(df1, df2, on="id", how="outer")
print("Outer Join:\n", outer_join)

PySpark

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("Joins").getOrCreate()

# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, 85), (3, 90), (4, 95)], ["id", "score"])

# Inner Join
inner_join = df1.join(df2, on="id", how="inner")
inner_join.show()

# Left Join
left_join = df1.join(df2, on="id", how="left")
left_join.show()

# Right Join
right_join = df1.join(df2, on="id", how="right")
right_join.show()

# Outer Join
outer_join = df1.join(df2, on="id", how="outer")
outer_join.show()

2. Union

Pandas

# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'id': [3, 4], 'name': ['Charlie', 'David']})

# Union
union_df = pd.concat([df1, df2]).reset_index(drop=True)
print("Union:\n", union_df)

PySpark

# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = spark.createDataFrame([(3, "Charlie"), (4, "David")], ["id", "name"])

# Union
union_df = df1.union(df2)
union_df.show()

3. Intersect

Pandas

Pandas does not have a direct method for intersection, but it can be achieved using merge with inner join.

pythonCopy code# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'name': ['Bob', 'Charlie', 'David']})

# Intersection
intersect_df = pd.merge(df1, df2)
print("Intersect:\n", intersect_df)

PySpark

# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "name"])

# Intersection
intersect_df = df1.intersect(df2)
intersect_df.show()

4. Except

Pandas

To find rows in one DataFrame and not in another:

# Example DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'name': ['Bob', 'Charlie', 'David']})

# Except (Difference)
except_df = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
print("Except:\n", except_df)

PySpark

# Example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(2, "Bob"), (3, "Charlie"), (4, "David")], ["id", "name"])

# Except
except_df = df1.exceptAll(df2)
except_df.show()

Key Differences Between Pandas and PySpark

FeaturePandasPySpark
ScaleBest for small datasets (in-memory)Scalable for big data (distributed)
PerformanceSingle-threaded, not distributedDistributed, optimized for large datasets
Ease of UseSimple syntax, PythonicRequires PySpark setup and more steps
Direct MethodsLimited for operations like intersectDirect methods for intersect and except

These examples demonstrate how to perform the same operations in Pandas and PySpark, highlighting their syntax and best use cases.

Pages ( 7 of 8 ): « Previous1 ... 56 7 8Next »