Selecting Columns & other Data Transformation Operations
Selecting columns is a fundamental operation in data manipulation. Both Pandas and PySpark provide a variety of ways to select columns. Below, I detail all the options and functions available for selecting columns in both libraries.
Selecting Columns in Pandas
1. Selecting a Single Column:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
pdf = pd.DataFrame(data)
# Selecting a single column
print(pdf['Name'])
print(pdf.Name) # Alternative syntax
2. Selecting Multiple Columns:
# Selecting multiple columns
print(pdf[['Name', 'Age']])
3. Using loc[]
:
# Selecting columns using loc
print(pdf.loc[:, ['Name', 'Age']])
4. Using iloc[]
:
# Selecting columns by index using iloc
print(pdf.iloc[:, [0, 1]]) # First two columns
5. Using Column Index:
# Selecting columns by index
print(pdf.iloc[:, [0, 2]]) # First and third columns
6. Selecting Columns Based on Data Type:
# Selecting columns based on data type
print(pdf.select_dtypes(include=['int64']))
print(pdf.select_dtypes(include=['object']))
7. Using Filter Method:
# Using filter method to select columns
print(pdf.filter(['Name', 'City']))
Selecting Columns in PySpark
1. Selecting a Single Column:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
data = [('Alice', 25, 'New York'), ('Bob', 30, 'Los Angeles'), ('Charlie', 35, 'Chicago')]
columns = ['Name', 'Age', 'City']
sdf = spark.createDataFrame(data, columns)
# Selecting a single column
sdf.select('Name').show()
2. Selecting Multiple Columns:
# Selecting multiple columns
sdf.select('Name', 'Age').show()
3. Using selectExpr
:
# Using selectExpr to select columns
sdf.selectExpr('Name', 'Age').show()
4. Using Column Object:
from pyspark.sql.functions import col
# Selecting columns using col function
sdf.select(col('Name'), col('Age')).show()
5. Using Column Index:
# Selecting columns by index (using df.columns)
selected_columns = sdf.select([sdf.columns[i] for i in [0, 2]])
selected_columns.show()
6. Selecting Columns Based on Condition:
# Selecting columns based on condition (example: starting with 'A')
selected_columns = sdf.select([col for col in sdf.columns if col.startswith('A')])
selected_columns.show()
7. Dropping Columns:
# Dropping columns and selecting the rest
sdf.drop('Age').show()
- Pandas:
- Single Column:
pdf['column']
,pdf.column
- Multiple Columns:
pdf[['col1', 'col2']]
- Using
loc[]
:pdf.loc[:, ['col1', 'col2']]
- Using
iloc[]
:pdf.iloc[:, [index1, index2]]
- Column Index:
pdf.iloc[:, [index1, index2]]
- Data Type:
pdf.select_dtypes(include=['dtype'])
- Filter:
pdf.filter(['col1', 'col2'])
- Single Column:
- PySpark:
- Single Column:
sdf.select('column')
- Multiple Columns:
sdf.select('col1', 'col2')
- Using
selectExpr
:sdf.selectExpr('col1', 'col2')
- Using Column Object:
sdf.select(col('col1'), col('col2'))
- Column Index:
sdf.select([sdf.columns[index1], sdf.columns[index2]])
- Condition:
sdf.select([col for col in sdf.columns if condition])
- Dropping Columns:
sdf.drop('column')
- Single Column:
Operation | Pandas | PySpark |
---|---|---|
Rename Columns | df.rename(columns={'old_name': 'new_name'}, inplace=True) | df = df.withColumnRenamed("old_name", "new_name") |
Drop Columns | df.drop(['col1', 'col2'], axis=1, inplace=True) | df = df.drop("col1", "col2") |
Collect Columns as List | ", ".join(df['column_name']) | df.selectExpr("collect_list(column_name)").collect() |
List String Type Columns | [col for col in df.columns if df[col].dtype == 'object'] | [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)] |
List Numeric Type Columns | [col for col in df.columns if pd.api.types.is_numeric_dtype(df[col])] | [f.name for f in df.schema.fields if isinstance(f.dataType, (IntegerType, DoubleType))] |
Handle Missing Values | df.fillna({'col1': 0, 'col2': 'unknown'}, inplace=True) | df = df.fillna({"col1": 0, "col2": "unknown"}) |
Data Type Conversion | df['col'] = df['col'].astype('int') | df = df.withColumn("col", df["col"].cast("int")) |
Detailed Examples
1. Rename Columns
Pandas:
import pandas as pd
df = pd.DataFrame({'old_name': [1, 2, 3]})
df.rename(columns={'old_name': 'new_name'}, inplace=True) print(df)
PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([(1,), (2,), (3,)], ["old_name"])
df = df.withColumnRenamed("old_name", "new_name")
df.show()
2. Drop Columns
- Pandas:pythonCopy code
df.drop(['col1', 'col2'], axis=1, inplace=True)
- PySpark:pythonCopy code
df = df.drop("col1", "col2")
3. Collect Columns as a List
Pandas:
df = pd.DataFrame({'column_name': ['Alice', 'Bob', 'Charlie']})
result = " ".join(df['column_name'])
print(result) # Output: "Alice Bob Charlie"
PySpark:
from pyspark.sql.functions import collect_list
df = spark.createDataFrame([("Alice",), ("Bob",), ("Charlie",)], ["column_name"]) result = df.select(collect_list("column_name")).collect()[0][0]
print(result) # Output: ['Alice', 'Bob', 'Charlie']
4. List String Type Columns
Pandas:
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
string_columns = [col for col in df.columns if df[col].dtype == 'object'] print(string_columns) # Output: ['name']
PySpark:
from pyspark.sql.types import StringType
string_columns = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
print(string_columns)
5. List Numeric Type Columns
Pandas:
import pandas as pd df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]}) numeric_columns = [col for col in df.columns if pd.api.types.is_numeric_dtype(df[col])]
print(numeric_columns) # Output: ['age']
PySpark:
from pyspark.sql.types import IntegerType, DoubleType
numeric_columns = [f.name for f in df.schema.fields if isinstance(f.dataType, (IntegerType, DoubleType))]
print(numeric_columns)
6. Handle Missing Values
Pandas:
df = pd.DataFrame({'col1': [1, None], 'col2': [None, 'data']})
df.fillna({'col1': 0, 'col2': 'unknown'}, inplace=True)
print(df)
PySpark:
df = spark.createDataFrame([(1, None), (None, "data")], ["col1", "col2"])
df = df.fillna({"col1": 0, "col2": "unknown"})
df.show()
7. Data Type Conversion
Pandas:
df = pd.DataFrame({'col': ['1', '2', '3']})
df['col'] = df['col'].astype('int')
print(df)
PySpark:
df = spark.createDataFrame([("1",), ("2",), ("3",)], ["col"])
df = df.withColumn("col", df["col"].cast("int"))
df.show()