Coding Questions in Spark SQL, Pyspark, and Python

๐Ÿงฌ To find Nth-highest instead of second-highest

To find the Nth-highest salary in each department in:

  • โœ… PySpark SQL
  • โœ… PySpark DataFrame API
  • โœ… Pandas

we generalize the logic used for 2nd-highest by using a parameterized rank (like DENSE_RANK() or ROW_NUMBER()).


๐Ÿงฑ Sample DataFrame Schema

+----+----------+--------+------------+
| id | name     | salary | department |
+----+----------+--------+------------+

โœ… 1. PySpark SQL: Nth-highest salary

๐Ÿ”น Replace 2 with a parameter like :n

n = 3  # Example: 3rd highest

query = f"""
    SELECT *
    FROM (
        SELECT *,
               DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk
        FROM employees
    ) ranked
    WHERE rnk = {n}
"""

df.createOrReplaceTempView("employees")
spark.sql(query).show()

โœ… 2. PySpark DataFrame API: Nth-highest salary

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank, col

n = 3  # Example: 3rd highest

window_spec = Window.partitionBy("department").orderBy(col("salary").desc())

df_with_rank = df.withColumn("rnk", dense_rank().over(window_spec))

nth_highest_df = df_with_rank.filter(col("rnk") == n)
nth_highest_df.show()

โœ… 3. Pandas: Nth-highest salary

import pandas as pd

data = [
    {"id": 1, "name": "Alice", "salary": 70000, "department": "HR"},
    {"id": 2, "name": "Bob", "salary": 50000, "department": "HR"},
    {"id": 3, "name": "Charlie", "salary": 60000, "department": "HR"},
    {"id": 4, "name": "David", "salary": 80000, "department": "IT"},
    {"id": 5, "name": "Eve", "salary": 70000, "department": "IT"},
    {"id": 6, "name": "Frank", "salary": 60000, "department": "IT"},
]

df = pd.DataFrame(data)
n = 3  # Example: 3rd highest

df["rnk"] = df.groupby("department")["salary"].rank(method="dense", ascending=False)

nth_highest = df[df["rnk"] == n]
print(nth_highest)

โš ๏ธ Choose Between:

  • DENSE_RANK() โ€“ if you want ties to be ranked the same (e.g., 70k, 70k = rank 1)
  • ROW_NUMBER() โ€“ if you want unique ordering even for duplicate salaries

๐Ÿงช Output Example (3rd-highest)

namesalarydepartmentrnk
Charlie60000HR3
Frank60000IT3

โœ… BONUS: Reusable PySpark Function

def get_nth_highest_salary(df, n, col_dept="department", col_salary="salary"):
    from pyspark.sql.window import Window
    from pyspark.sql.functions import dense_rank

    window_spec = Window.partitionBy(col_dept).orderBy(col(col_salary).desc())
    return (
        df.withColumn("rank", dense_rank().over(window_spec))
          .filter(col("rank") == n)
    )

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19