Coding Questions in Spark SQL, Pyspark, and Python

To find the second-highest salary for each department in spark sql and pyspark dataframe API?

Here are the solutions:

Spark SQL

You can use the ROW_NUMBER() function to assign a ranking to each salary within a department, and then select the rows with a ranking of 2.

SELECT department, salary
FROM (
  SELECT department, salary,
         ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
  FROM employees
) AS subquery
WHERE salary_rank = 2

PySpark DataFrame

You can use the window function to create a window specification, and then use the row_number function to assign a ranking to each salary within a department.

from pyspark.sql import functions as F
from pyspark.sql import Window

# create a window specification
window_spec = Window.partitionBy("department").orderBy(F.col("salary").desc())

# assign a ranking to each salary within a department
df_with_rank = df.withColumn("salary_rank", F.row_number().over(window_spec))

# select the rows with a ranking of 2
second_highest_salaries = df_with_rank.filter(F.col("salary_rank") == 2).select("department", "salary")

Note that in both solutions, we assume that there are no duplicate salaries within a department. If there are duplicate salaries, you may want to use the DENSE_RANK() function instead of ROW_NUMBER().


Pages ( 2 of 11 ): « Previous1 2 34 ... 11Next »