Useful Code Snippets in Python and Pyspark

by lochan2014 | Jan 7, 2025 | Pyspark, Python | 0 comments

In PySpark, select(), selectExpr(), and expr() are all used to manipulate and select columns from a DataFrame, but they have different use cases. Let’s break them down with examples.


1️⃣ select()

  • Used to select specific columns using column expressions (F.col()).
  • You can also apply transformations using F functions.

🔹 Example:

from pyspark.sql import functions as F

df.select(F.col("name"), F.col("salary") * 2).show()

Best when selecting columns and applying column-based transformations.


2️⃣ selectExpr()

  • Used to apply SQL expressions as strings.
  • Useful when performing operations that would be easier in SQL syntax.

🔹 Example:

df.selectExpr("name", "salary * 2 as double_salary").show()

Best when you want to use SQL-like expressions without F functions.


3️⃣ expr()

  • Converts a SQL expression into a PySpark column object.
  • Can be used inside select(), withColumn(), etc.

🔹 Example:

df.select(F.expr("salary * 2")).show()

It is equivalent to:

df.selectExpr("salary * 2")

Best when embedding SQL expressions inside column operations.


🔥 Key Differences:

MethodInput TypeSQL SyntaxColumn Expressions
select()F.col() or raw column names❌ No✅ Yes
selectExpr()SQL strings✅ Yes❌ No
expr()SQL string (converted to column)✅ Yes✅ Yes

🔥 When to Use What?

ScenarioBest Method
Simple column selectionselect("col1", "col2")
Column transformations using F functionsselect(F.col("salary") * 2)
Using SQL-like expressionsselectExpr("salary * 2 as double_salary")
Embedding SQL expressions in column operationsexpr("salary * 2")

🚀

Written By

undefined

Related Posts

Submit a Comment

Your email address will not be published. Required fields are marked *