HintsToday

Hints and Answers for Everything

recent posts

about

Coding Questions in Spark SQL, Pyspark, and Python

🧬 To find Nth-highest instead of second-highest

To find the Nth-highest salary in each department in:

✅ PySpark SQL
✅ PySpark DataFrame API
✅ Pandas

we generalize the logic used for 2nd-highest by using a parameterized rank (like DENSE_RANK() or ROW_NUMBER()).

🧱 Sample DataFrame Schema

+----+----------+--------+------------+
| id | name     | salary | department |
+----+----------+--------+------------+

✅ 1. PySpark SQL: Nth-highest salary

🔹 Replace `2` with a parameter like `:n`

n = 3  # Example: 3rd highest

query = f"""
    SELECT *
    FROM (
        SELECT *,
               DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk
        FROM employees
    ) ranked
    WHERE rnk = {n}
"""

df.createOrReplaceTempView("employees")
spark.sql(query).show()

✅ 2. PySpark DataFrame API: Nth-highest salary

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank, col

n = 3  # Example: 3rd highest

window_spec = Window.partitionBy("department").orderBy(col("salary").desc())

df_with_rank = df.withColumn("rnk", dense_rank().over(window_spec))

nth_highest_df = df_with_rank.filter(col("rnk") == n)
nth_highest_df.show()

✅ 3. Pandas: Nth-highest salary

import pandas as pd

data = [
    {"id": 1, "name": "Alice", "salary": 70000, "department": "HR"},
    {"id": 2, "name": "Bob", "salary": 50000, "department": "HR"},
    {"id": 3, "name": "Charlie", "salary": 60000, "department": "HR"},
    {"id": 4, "name": "David", "salary": 80000, "department": "IT"},
    {"id": 5, "name": "Eve", "salary": 70000, "department": "IT"},
    {"id": 6, "name": "Frank", "salary": 60000, "department": "IT"},
]

df = pd.DataFrame(data)
n = 3  # Example: 3rd highest

df["rnk"] = df.groupby("department")["salary"].rank(method="dense", ascending=False)

nth_highest = df[df["rnk"] == n]
print(nth_highest)

⚠️ Choose Between:

DENSE_RANK() – if you want ties to be ranked the same (e.g., 70k, 70k = rank 1)
ROW_NUMBER() – if you want unique ordering even for duplicate salaries

🧪 Output Example (3rd-highest)

name	salary	department	rnk
Charlie	60000	HR	3
Frank	60000	IT	3

✅ BONUS: Reusable PySpark Function

def get_nth_highest_salary(df, n, col_dept="department", col_salary="salary"):
    from pyspark.sql.window import Window
    from pyspark.sql.functions import dense_rank

    window_spec = Window.partitionBy(col_dept).orderBy(col(col_salary).desc())
    return (
        df.withColumn("rank", dense_rank().over(window_spec))
          .filter(col("rank") == n)
    )