In PySpark, select()
, selectExpr()
, and expr()
are all used to manipulate and select columns from a DataFrame, but they have different use cases. Let’s break them down with examples.
1️⃣ select()
- Used to select specific columns using column expressions (
F.col()
). - You can also apply transformations using
F
functions.
🔹 Example:
from pyspark.sql import functions as F
df.select(F.col("name"), F.col("salary") * 2).show()
✅ Best when selecting columns and applying column-based transformations.
2️⃣ selectExpr()
- Used to apply SQL expressions as strings.
- Useful when performing operations that would be easier in SQL syntax.
🔹 Example:
df.selectExpr("name", "salary * 2 as double_salary").show()
✅ Best when you want to use SQL-like expressions without F
functions.
3️⃣ expr()
- Converts a SQL expression into a PySpark column object.
- Can be used inside
select()
,withColumn()
, etc.
🔹 Example:
df.select(F.expr("salary * 2")).show()
It is equivalent to:
df.selectExpr("salary * 2")
✅ Best when embedding SQL expressions inside column operations.
🔥 Key Differences:
Method | Input Type | SQL Syntax | Column Expressions |
---|---|---|---|
select() | F.col() or raw column names | ❌ No | ✅ Yes |
selectExpr() | SQL strings | ✅ Yes | ❌ No |
expr() | SQL string (converted to column) | ✅ Yes | ✅ Yes |
🔥 When to Use What?
Scenario | Best Method |
---|---|
Simple column selection | select("col1", "col2") |
Column transformations using F functions | select(F.col("salary") * 2) |
Using SQL-like expressions | selectExpr("salary * 2 as double_salary") |
Embedding SQL expressions in column operations | expr("salary * 2") |
🚀