PySpark orderBy() and sort() Operations

In PySpark, both orderBy() and sort() are used to sort the rows of a DataFrame. They are functionally identical and can be used interchangeably. Sorting can be applied to one or multiple columns, with options for ascending or descending order.


Syntax

DataFrame.orderBy(*cols, ascending=True)
DataFrame.sort(*cols, ascending=True)
  • cols: A list of column names or expressions to sort by.
  • ascending:
    • A boolean (True for ascending, False for descending) that applies to all columns if given as a single value.
    • A list of booleans specifying sort order for each column if given as a list.

Example Dataset

data = [
    ("Alice", 34, "HR", 3000),
    ("Bob", 45, "IT", 4000),
    ("Catherine", 29, "HR", 5000),
    ("David", 36, "IT", 2500),
    ("Eve", 28, "Sales", 2800)
]

columns = ["Name", "Age", "Department", "Salary"]

df = spark.createDataFrame(data, schema=columns)
df.show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
|    Alice| 34|        HR|  3000|
|      Bob| 45|        IT|  4000|
| Catherine| 29|        HR|  5000|
|     David| 36|        IT|  2500|
|       Eve| 28|     Sales|  2800|
+---------+---+----------+------+

Sorting Examples

1. Sort by Single Column (Ascending)

Sort the DataFrame by the Age column in ascending order:

df.orderBy("Age").show()
# Equivalent:
df.sort("Age").show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
|      Eve| 28|     Sales|  2800|
| Catherine| 29|        HR|  5000|
|    Alice| 34|        HR|  3000|
|     David| 36|        IT|  2500|
|      Bob| 45|        IT|  4000|
+---------+---+----------+------+

2. Sort by Single Column (Descending)

Sort by the Age column in descending order:

from pyspark.sql.functions import col

df.orderBy(col("Age").desc()).show()
# Equivalent:
df.sort(col("Age").desc()).show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
|      Bob| 45|        IT|  4000|
|     David| 36|        IT|  2500|
|    Alice| 34|        HR|  3000|
| Catherine| 29|        HR|  5000|
|      Eve| 28|     Sales|  2800|
+---------+---+----------+------+

3. Sort by Multiple Columns

Sort by Department (ascending) and Salary (descending):

df.orderBy("Department", col("Salary").desc()).show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
| Catherine| 29|        HR|  5000|
|    Alice| 34|        HR|  3000|
|      Bob| 45|        IT|  4000|
|     David| 36|        IT|  2500|
|      Eve| 28|     Sales|  2800|
+---------+---+----------+------+

4. Sort by Multiple Columns with Custom Sort Orders

Define different sort orders for columns:

df.orderBy(["Department", "Age"], ascending=[True, False]).show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
| Catherine| 29|        HR|  5000|
|    Alice| 34|        HR|  3000|
|      Bob| 45|        IT|  4000|
|     David| 36|        IT|  2500|
|      Eve| 28|     Sales|  2800|
+---------+---+----------+------+

Additional Use Cases

5. Sort by Expression

Sort by a calculated value such as Salary + Age:

from pyspark.sql.functions import expr

df.orderBy(expr("Salary + Age").desc()).show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
|      Bob| 45|        IT|  4000|
| Catherine| 29|        HR|  5000|
|    Alice| 34|        HR|  3000|
|     David| 36|        IT|  2500|
|      Eve| 28|     Sales|  2800|
+---------+---+----------+------+

6. Sorting Large Datasets

For large datasets, you can enable sorting optimization using:

spark.conf.set("spark.sql.shuffle.partitions", 8)
df.orderBy("Salary").show()

7. Temporary View and SQL Sorting

Create a temporary view and use SQL for sorting:

df.createOrReplaceTempView("employees")
spark.sql("SELECT * FROM employees ORDER BY Salary DESC").show()

Output:

+---------+---+----------+------+
|     Name|Age|Department|Salary|
+---------+---+----------+------+
| Catherine| 29|        HR|  5000|
|      Bob| 45|        IT|  4000|
|    Alice| 34|        HR|  3000|
|      Eve| 28|     Sales|  2800|
|     David| 36|        IT|  2500|
+---------+---+----------+------+

Summary

  • orderBy() and sort() provide powerful and flexible sorting options.
  • You can sort by single or multiple columns with different sort orders.
  • Custom expressions and SQL integration expand the use cases for sorting.


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Pages ( 5 of 8 ): « Previous1 ... 34 5 678Next »

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading