Filtering Data
Filtering data is a crucial step in data analysis and manipulation. Both Pandas and PySpark offer various methods and functions for filtering DataFrames. Below are detailed examples of all the options and functions available for filtering data in both libraries.
Filtering Data in Pandas
1. Filtering with a Single Condition:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
pdf = pd.DataFrame(data)
# Filtering with a single condition
filtered_pdf = pdf[pdf['Age'] > 30]
print(filtered_pdf)
2. Filtering with Multiple Conditions:
# Filtering with multiple conditions using & and |
filtered_pdf = pdf[(pdf['Age'] > 25) & (pdf['City'] == 'Los Angeles')]
print(filtered_pdf)
3. Using query()
Method:
# Filtering using query method
filtered_pdf = pdf.query('Age > 25 and City == "Los Angeles"')
print(filtered_pdf)
4. Filtering with isin()
:
# Filtering using isin method
filtered_pdf = pdf[pdf['City'].isin(['New York', 'Chicago'])]
print(filtered_pdf)
5. Filtering with str.contains()
:
# Filtering using str.contains method for string operations
filtered_pdf = pdf[pdf['City'].str.contains('New')]
print(filtered_pdf)
6. Filtering with between()
:
# Filtering using between method
filtered_pdf = pdf[pdf['Age'].between(25, 30)]
print(filtered_pdf)
7. Filtering with loc[]
:
# Filtering using loc
filtered_pdf = pdf.loc[pdf['Age'] > 30]
print(filtered_pdf)
8. Filtering with mask()
:
# Filtering using mask (inverse of filter)
masked_pdf = pdf.mask(pdf['Age'] > 30)
print(masked_pdf)
9. Filtering with filter()
:
# Filtering rows using filter method
filtered_pdf = pdf.filter(items=[0, 2], axis=0) # Filter rows by index
print(filtered_pdf)
Filtering Data in PySpark
1. Filtering with a Single Condition:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
data = [('Alice', 25, 'New York'), ('Bob', 30, 'Los Angeles'), ('Charlie', 35, 'Chicago')]
columns = ['Name', 'Age', 'City']
sdf = spark.createDataFrame(data, columns)
# Filtering with a single condition
filtered_sdf = sdf.filter(sdf['Age'] > 30)
filtered_sdf.show()
2. Filtering with Multiple Conditions:
# Filtering with multiple conditions using & and |
filtered_sdf = sdf.filter((sdf['Age'] > 25) & (sdf['City'] == 'Los Angeles'))
filtered_sdf.show()
3. Using where()
Method:
# Filtering using where method
filtered_sdf = sdf.where(sdf['Age'] > 30)
filtered_sdf.show()
4. Filtering with isin()
:
# Filtering using isin method
filtered_sdf = sdf.filter(sdf['City'].isin(['New York', 'Chicago']))
filtered_sdf.show()
5. Filtering with contains()
:
# Filtering using contains method for string operations
filtered_sdf = sdf.filter(sdf['City'].contains('New'))
filtered_sdf.show()
6. Filtering with between()
:
# Filtering using between method
filtered_sdf = sdf.filter(sdf['Age'].between(25, 30))
filtered_sdf.show()
7. Filtering with SQL Expression:
# Register the DataFrame as a SQL temporary view
sdf.createOrReplaceTempView("people")
# SQL query
filtered_sdf = spark.sql("SELECT * FROM people WHERE Age > 30")
filtered_sdf.show()
8. Using selectExpr()
with Conditional Expression:
# Filtering using selectExpr with conditional expression
filtered_sdf = sdf.selectExpr("Name", "Age", "City").filter("Age > 30")
filtered_sdf.show()
- Pandas:
- Single Condition:
pdf[pdf['column'] condition]
- Multiple Conditions:
pdf[(pdf['col1'] condition) & (pdf['col2'] condition)]
- Using
query()
Method:pdf.query('condition')
- Using
isin()
:pdf[pdf['column'].isin(['value1', 'value2'])]
- Using
str.contains()
:pdf[pdf['column'].str.contains('pattern')]
- Using
between()
:pdf[pdf['column'].between(start, end)]
- Using
loc[]
:pdf.loc[pdf['column'] condition]
- Using
mask()
:pdf.mask(pdf['column'] condition)
- Using
filter()
:pdf.filter(items=[index1, index2], axis=0)
- Single Condition:
- PySpark:
- Single Condition:
sdf.filter(sdf['column'] condition)
- Multiple Conditions:
sdf.filter((sdf['col1'] condition) & (sdf['col2'] condition))
- Using
where()
Method:sdf.where(sdf['column'] condition)
- Using
isin()
:sdf.filter(sdf['column'].isin(['value1', 'value2']))
- Using
contains()
:sdf.filter(sdf['column'].contains('pattern'))
- Using
between()
:sdf.filter(sdf['column'].between(start, end))
- Using SQL Expression:
spark.sql("SELECT * FROM view WHERE condition")
- Using
selectExpr()
with Conditional Expression:sdf.selectExpr("col1", "col2").filter("condition")
- Single Condition: