PySpark orderBy()
and sort()
Operations
In PySpark, both orderBy()
and sort()
are used to sort the rows of a DataFrame. They are functionally identical and can be used interchangeably. Sorting can be applied to one or multiple columns, with options for ascending or descending order.
Syntax
DataFrame.orderBy(*cols, ascending=True)
DataFrame.sort(*cols, ascending=True)
cols
: A list of column names or expressions to sort by.ascending
:- A boolean (
True
for ascending,False
for descending) that applies to all columns if given as a single value. - A list of booleans specifying sort order for each column if given as a list.
- A boolean (
Example Dataset
data = [
("Alice", 34, "HR", 3000),
("Bob", 45, "IT", 4000),
("Catherine", 29, "HR", 5000),
("David", 36, "IT", 2500),
("Eve", 28, "Sales", 2800)
]
columns = ["Name", "Age", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Alice| 34| HR| 3000|
| Bob| 45| IT| 4000|
| Catherine| 29| HR| 5000|
| David| 36| IT| 2500|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
Sorting Examples
1. Sort by Single Column (Ascending)
Sort the DataFrame by the Age
column in ascending order:
df.orderBy("Age").show()
# Equivalent:
df.sort("Age").show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Eve| 28| Sales| 2800|
| Catherine| 29| HR| 5000|
| Alice| 34| HR| 3000|
| David| 36| IT| 2500|
| Bob| 45| IT| 4000|
+---------+---+----------+------+
2. Sort by Single Column (Descending)
Sort by the Age
column in descending order:
from pyspark.sql.functions import col
df.orderBy(col("Age").desc()).show()
# Equivalent:
df.sort(col("Age").desc()).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Bob| 45| IT| 4000|
| David| 36| IT| 2500|
| Alice| 34| HR| 3000|
| Catherine| 29| HR| 5000|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
3. Sort by Multiple Columns
Sort by Department
(ascending) and Salary
(descending):
df.orderBy("Department", col("Salary").desc()).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Catherine| 29| HR| 5000|
| Alice| 34| HR| 3000|
| Bob| 45| IT| 4000|
| David| 36| IT| 2500|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
4. Sort by Multiple Columns with Custom Sort Orders
Define different sort orders for columns:
df.orderBy(["Department", "Age"], ascending=[True, False]).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Catherine| 29| HR| 5000|
| Alice| 34| HR| 3000|
| Bob| 45| IT| 4000|
| David| 36| IT| 2500|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
Additional Use Cases
5. Sort by Expression
Sort by a calculated value such as Salary + Age
:
from pyspark.sql.functions import expr
df.orderBy(expr("Salary + Age").desc()).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Bob| 45| IT| 4000|
| Catherine| 29| HR| 5000|
| Alice| 34| HR| 3000|
| David| 36| IT| 2500|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
6. Sorting Large Datasets
For large datasets, you can enable sorting optimization using:
spark.conf.set("spark.sql.shuffle.partitions", 8)
df.orderBy("Salary").show()
7. Temporary View and SQL Sorting
Create a temporary view and use SQL for sorting:
df.createOrReplaceTempView("employees")
spark.sql("SELECT * FROM employees ORDER BY Salary DESC").show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Catherine| 29| HR| 5000|
| Bob| 45| IT| 4000|
| Alice| 34| HR| 3000|
| Eve| 28| Sales| 2800|
| David| 36| IT| 2500|
+---------+---+----------+------+
Summary
orderBy()
andsort()
provide powerful and flexible sorting options.- You can sort by single or multiple columns with different sort orders.
- Custom expressions and SQL integration expand the use cases for sorting.
Discover more from HintsToday
Subscribe to get the latest posts sent to your email.