Comparison Between Pandas and PySpark for Data Analysis

Selecting Columns & other Data Transformation Operations

Selecting columns is a fundamental operation in data manipulation. Both Pandas and PySpark provide a variety of ways to select columns. Below, I detail all the options and functions available for selecting columns in both libraries.

Selecting Columns in Pandas

1. Selecting a Single Column:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

pdf = pd.DataFrame(data)

# Selecting a single column
print(pdf['Name'])
print(pdf.Name)  # Alternative syntax

2. Selecting Multiple Columns:

# Selecting multiple columns
print(pdf[['Name', 'Age']])

3. Using loc[]:

# Selecting columns using loc
print(pdf.loc[:, ['Name', 'Age']])

4. Using iloc[]:

# Selecting columns by index using iloc
print(pdf.iloc[:, [0, 1]]) # First two columns

5. Using Column Index:

# Selecting columns by index
print(pdf.iloc[:, [0, 2]]) # First and third columns

6. Selecting Columns Based on Data Type:

# Selecting columns based on data type
print(pdf.select_dtypes(include=['int64']))
print(pdf.select_dtypes(include=['object']))

7. Using Filter Method:

# Using filter method to select columns
print(pdf.filter(['Name', 'City']))

Selecting Columns in PySpark

1. Selecting a Single Column:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

data = [('Alice', 25, 'New York'), ('Bob', 30, 'Los Angeles'), ('Charlie', 35, 'Chicago')]
columns = ['Name', 'Age', 'City']

sdf = spark.createDataFrame(data, columns)

# Selecting a single column
sdf.select('Name').show()

2. Selecting Multiple Columns:

# Selecting multiple columns
sdf.select('Name', 'Age').show()

3. Using selectExpr:

# Using selectExpr to select columns
sdf.selectExpr('Name', 'Age').show()

4. Using Column Object:

from pyspark.sql.functions import col

# Selecting columns using col function
sdf.select(col('Name'), col('Age')).show()

5. Using Column Index:

# Selecting columns by index (using df.columns)
selected_columns = sdf.select([sdf.columns[i] for i in [0, 2]])
selected_columns.show()

6. Selecting Columns Based on Condition:

# Selecting columns based on condition (example: starting with 'A')
selected_columns = sdf.select([col for col in sdf.columns if col.startswith('A')])
selected_columns.show()

7. Dropping Columns:

# Dropping columns and selecting the rest
sdf.drop('Age').show()
  • Pandas:
    • Single Column: pdf['column'], pdf.column
    • Multiple Columns: pdf[['col1', 'col2']]
    • Using loc[]: pdf.loc[:, ['col1', 'col2']]
    • Using iloc[]: pdf.iloc[:, [index1, index2]]
    • Column Index: pdf.iloc[:, [index1, index2]]
    • Data Type: pdf.select_dtypes(include=['dtype'])
    • Filter: pdf.filter(['col1', 'col2'])
  • PySpark:
    • Single Column: sdf.select('column')
    • Multiple Columns: sdf.select('col1', 'col2')
    • Using selectExpr: sdf.selectExpr('col1', 'col2')
    • Using Column Object: sdf.select(col('col1'), col('col2'))
    • Column Index: sdf.select([sdf.columns[index1], sdf.columns[index2]])
    • Condition: sdf.select([col for col in sdf.columns if condition])
    • Dropping Columns: sdf.drop('column')

OperationPandasPySpark
Rename Columnsdf.rename(columns={'old_name': 'new_name'}, inplace=True)df = df.withColumnRenamed("old_name", "new_name")
Drop Columnsdf.drop(['col1', 'col2'], axis=1, inplace=True)df = df.drop("col1", "col2")
Collect Columns as List", ".join(df['column_name'])df.selectExpr("collect_list(column_name)").collect()
List String Type Columns[col for col in df.columns if df[col].dtype == 'object'][f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
List Numeric Type Columns[col for col in df.columns if pd.api.types.is_numeric_dtype(df[col])][f.name for f in df.schema.fields if isinstance(f.dataType, (IntegerType, DoubleType))]
Handle Missing Valuesdf.fillna({'col1': 0, 'col2': 'unknown'}, inplace=True)df = df.fillna({"col1": 0, "col2": "unknown"})
Data Type Conversiondf['col'] = df['col'].astype('int')df = df.withColumn("col", df["col"].cast("int"))

Detailed Examples

1. Rename Columns

Pandas:

import pandas as pd 
df = pd.DataFrame({'old_name': [1, 2, 3]}) 
df.rename(columns={'old_name': 'new_name'}, inplace=True) print(df)

PySpark:

from pyspark.sql import SparkSession 
spark = SparkSession.builder.appName("example").getOrCreate() 
df = spark.createDataFrame([(1,), (2,), (3,)], ["old_name"]) 
df = df.withColumnRenamed("old_name", "new_name") 
df.show()

2. Drop Columns

  • Pandas:pythonCopy codedf.drop(['col1', 'col2'], axis=1, inplace=True)
  • PySpark:pythonCopy codedf = df.drop("col1", "col2")

3. Collect Columns as a List

Pandas:

df = pd.DataFrame({'column_name': ['Alice', 'Bob', 'Charlie']}) 
result = " ".join(df['column_name']) 
print(result) # Output: "Alice Bob Charlie"

PySpark:

from pyspark.sql.functions import collect_list 
df = spark.createDataFrame([("Alice",), ("Bob",), ("Charlie",)], ["column_name"]) result = df.select(collect_list("column_name")).collect()[0][0] 
print(result) # Output: ['Alice', 'Bob', 'Charlie']

4. List String Type Columns

Pandas:

df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]}) 
string_columns = [col for col in df.columns if df[col].dtype == 'object'] print(string_columns) # Output: ['name']

PySpark:

from pyspark.sql.types import StringType 
string_columns = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)] 
print(string_columns)

5. List Numeric Type Columns

Pandas:

import pandas as pd df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]}) numeric_columns = [col for col in df.columns if pd.api.types.is_numeric_dtype(df[col])] 
print(numeric_columns) # Output: ['age']

PySpark:

from pyspark.sql.types import IntegerType, DoubleType 
numeric_columns = [f.name for f in df.schema.fields if isinstance(f.dataType, (IntegerType, DoubleType))] 
print(numeric_columns)

6. Handle Missing Values

Pandas:

df = pd.DataFrame({'col1': [1, None], 'col2': [None, 'data']})
df.fillna({'col1': 0, 'col2': 'unknown'}, inplace=True) 
print(df)

PySpark:

df = spark.createDataFrame([(1, None), (None, "data")], ["col1", "col2"]) 
df = df.fillna({"col1": 0, "col2": "unknown"}) 
df.show()

7. Data Type Conversion

Pandas:

df = pd.DataFrame({'col': ['1', '2', '3']}) 
df['col'] = df['col'].astype('int') 
print(df)

PySpark:

df = spark.createDataFrame([("1",), ("2",), ("3",)], ["col"]) 
df = df.withColumn("col", df["col"].cast("int")) 
df.show()

Pages ( 4 of 8 ): « Previous123 4 56 ... 8Next »