๐งฌ To find Nth-highest instead of second-highest
To find the Nth-highest salary in each department in:
- โ PySpark SQL
- โ PySpark DataFrame API
- โ Pandas
we generalize the logic used for 2nd-highest by using a parameterized rank
(like DENSE_RANK()
or ROW_NUMBER()
).
๐งฑ Sample DataFrame Schema
+----+----------+--------+------------+
| id | name | salary | department |
+----+----------+--------+------------+
โ 1. PySpark SQL: Nth-highest salary
๐น Replace 2
with a parameter like :n
n = 3 # Example: 3rd highest
query = f"""
SELECT *
FROM (
SELECT *,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk
FROM employees
) ranked
WHERE rnk = {n}
"""
df.createOrReplaceTempView("employees")
spark.sql(query).show()
โ 2. PySpark DataFrame API: Nth-highest salary
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank, col
n = 3 # Example: 3rd highest
window_spec = Window.partitionBy("department").orderBy(col("salary").desc())
df_with_rank = df.withColumn("rnk", dense_rank().over(window_spec))
nth_highest_df = df_with_rank.filter(col("rnk") == n)
nth_highest_df.show()
โ 3. Pandas: Nth-highest salary
import pandas as pd
data = [
{"id": 1, "name": "Alice", "salary": 70000, "department": "HR"},
{"id": 2, "name": "Bob", "salary": 50000, "department": "HR"},
{"id": 3, "name": "Charlie", "salary": 60000, "department": "HR"},
{"id": 4, "name": "David", "salary": 80000, "department": "IT"},
{"id": 5, "name": "Eve", "salary": 70000, "department": "IT"},
{"id": 6, "name": "Frank", "salary": 60000, "department": "IT"},
]
df = pd.DataFrame(data)
n = 3 # Example: 3rd highest
df["rnk"] = df.groupby("department")["salary"].rank(method="dense", ascending=False)
nth_highest = df[df["rnk"] == n]
print(nth_highest)
โ ๏ธ Choose Between:
DENSE_RANK()
โ if you want ties to be ranked the same (e.g., 70k, 70k = rank 1)ROW_NUMBER()
โ if you want unique ordering even for duplicate salaries
๐งช Output Example (3rd-highest)
name | salary | department | rnk |
---|---|---|---|
Charlie | 60000 | HR | 3 |
Frank | 60000 | IT | 3 |
โ BONUS: Reusable PySpark Function
def get_nth_highest_salary(df, n, col_dept="department", col_salary="salary"):
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
window_spec = Window.partitionBy(col_dept).orderBy(col(col_salary).desc())
return (
df.withColumn("rank", dense_rank().over(window_spec))
.filter(col("rank") == n)
)