To find the second-highest salary for each department in spark sql and pyspark dataframe API?
Here are the solutions:
Spark SQL
You can use the ROW_NUMBER()
function to assign a ranking to each salary within a department, and then select the rows with a ranking of 2.
SELECT department, salary
FROM (
SELECT department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
FROM employees
) AS subquery
WHERE salary_rank = 2
PySpark DataFrame
You can use the window
function to create a window specification, and then use the row_number
function to assign a ranking to each salary within a department.
from pyspark.sql import functions as F
from pyspark.sql import Window
# create a window specification
window_spec = Window.partitionBy("department").orderBy(F.col("salary").desc())
# assign a ranking to each salary within a department
df_with_rank = df.withColumn("salary_rank", F.row_number().over(window_spec))
# select the rows with a ranking of 2
second_highest_salaries = df_with_rank.filter(F.col("salary_rank") == 2).select("department", "salary")
Note that in both solutions, we assume that there are no duplicate salaries within a department. If there are duplicate salaries, you may want to use the DENSE_RANK()
function instead of ROW_NUMBER()
.