3.PySpark orderBy() and sort() Operations
In PySpark, both orderBy()
and sort()
are used to sort the rows of a DataFrame. They can be used interchangeably, as they provide the same functionality.
Syntax
DataFrame.orderBy(*cols, ascending=True)
DataFrame.sort(*cols, ascending=True)
cols
: List of column names or expressions to sort by.ascending
: Boolean or list of booleans. If a single boolean is provided, it applies to all columns. If a list is provided, it specifies the sort order for each corresponding column.
data = [
("Alice", 34, "HR", 3000),
("Bob", 45, "IT", 4000),
("Catherine", 29, "HR", 5000),
("David", 36, "IT", 2500),
("Eve", 28, "Sales", 2800)
]
columns = ["Name", "Age", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
1. Order By Single Column Ascending
df.orderBy("Age").show()
Equivalent:
df.sort("Age").show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Eve| 28| Sales| 2800|
|Catherine| 29| HR| 5000|
| Alice| 34| HR| 3000|
| David| 36| IT| 2500|
| Bob| 45| IT| 4000|
+---------+---+----------+------+
2. Order By Single Column Descending
df.orderBy(col("Age").desc()).show()
df.sort(col("Age").desc()).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Bob| 45| IT| 4000|
| David| 36| IT| 2500|
| Alice| 34| HR| 3000|
|Catherine| 29| HR| 5000|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
3. Order By Multiple Columns
df.sort("col1", "col2")
df.sort(col("col1").asc(), col("col2").desc())
4. Order By Multiple Columns with Different Sort Orders
df.orderBy(["Department", "Age"], ascending=[True, False]).show()
df.sort(["Department", "Age"], ascending=[True, False]).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
|Catherine| 29| HR| 5000|
| Alice| 34| HR| 3000|
| Bob| 45| IT| 4000|
| David| 36| IT| 2500|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
Useful Examples
Sorting by Salary in Descending Order
df.orderBy(col("Salary").desc()).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
|Catherine| 29| HR| 5000|
| Bob| 45| IT| 4000|
| Alice| 34| HR| 3000|
| Eve| 28| Sales| 2800|
| David| 36| IT| 2500|
+---------+---+----------+------+
Sorting by Department and then by Salary within each Department
df.orderBy("Department", col("Salary").desc()).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
|Catherine| 29| HR| 5000|
| Alice| 34| HR| 3000|
| Bob| 45| IT| 4000|
| David| 36| IT| 2500|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
Sorting with Expression
df.orderBy(expr("Salary + Age").desc()).show()
Output:
+---------+---+----------+------+
| Name|Age|Department|Salary|
+---------+---+----------+------+
| Bob| 45| IT| 4000|
|Catherine| 29| HR| 5000|
| Alice| 34| HR| 3000|
| David| 36| IT| 2500|
| Eve| 28| Sales| 2800|
+---------+---+----------+------+
Leave a Reply