Coding Questions in Spark SQL, Pyspark, and Python

Employees Earning More than Their Managers– in Pyspark sql and dataframe api

To find employees earning more than their managers in PySpark, we’ll need a table that includes both employees and their managers, along with their salaries. Here’s how you can solve this using PySpark SQL and the DataFrame API.


Sample Data

Assume the following schema for the employees DataFrame:

  • id (int): Employee ID
  • name (string): Employee name
  • salary (int): Employee salary
  • manager_id (int): Manager ID (refers to the id of the manager in the same table)

Sample Data:

+----+-------+------+-----------+
| id | name  | salary| manager_id|
+----+-------+------+-----------+
| 1  | John  | 70000 | NULL      |
| 2  | Alice | 60000 | 1         |
| 3  | Bob   | 80000 | 1         |
| 4  | Carol | 90000 | 2         |
| 5  | Dave  | 50000 | 2         |
+----+-------+------+-----------+

Using PySpark SQL

  1. Create a self-join on the employees table where the manager_id matches the id of another employee.
  2. Filter employees whose salary is greater than their manager’s salary.
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.master("local").appName("EmployeesMoreThanManagers").getOrCreate()

# Create DataFrame
data = [
    (1, "John", 70000, None), 
    (2, "Alice", 60000, 1), 
    (3, "Bob", 80000, 1), 
    (4, "Carol", 90000, 2), 
    (5, "Dave", 50000, 2)
]
columns = ["id", "name", "salary", "manager_id"]

employees = spark.createDataFrame(data, columns)

# Register the DataFrame as a temporary SQL table
employees.createOrReplaceTempView("employees")

# PySpark SQL Query
query = """
SELECT e.name AS employee_name, e.salary AS employee_salary, m.name AS manager_name, m.salary AS manager_salary
FROM employees e
JOIN employees m
  ON e.manager_id = m.id
WHERE e.salary > m.salary
"""

result_sql = spark.sql(query)
result_sql.show()

Output:

+--------------+----------------+-------------+---------------+
| employee_name| employee_salary| manager_name| manager_salary|
+--------------+----------------+-------------+---------------+
| Bob          | 80000          | John        | 70000         |
| Carol        | 90000          | Alice       | 60000         |
+--------------+----------------+-------------+---------------+

Using PySpark DataFrame API

from pyspark.sql.functions import col

# Self-join on manager_id and id
joined_df = employees.alias("e").join(
    employees.alias("m"),
    col("e.manager_id") == col("m.id"),
    "inner"
)

# Filter employees earning more than their managers
result_df = joined_df.filter(col("e.salary") > col("m.salary")) \
    .select(
        col("e.name").alias("employee_name"),
        col("e.salary").alias("employee_salary"),
        col("m.name").alias("manager_name"),
        col("m.salary").alias("manager_salary")
    )

result_df.show()

Output:

+--------------+----------------+-------------+---------------+
| employee_name| employee_salary| manager_name| manager_salary|
+--------------+----------------+-------------+---------------+
| Bob          | 80000          | John        | 70000         |
| Carol        | 90000          | Alice       | 60000         |
+--------------+----------------+-------------+---------------+

Explanation

  1. Self-Join:
    • e (alias for employees) represents the employee.
    • m (alias for employees) represents the manager.
    • Join e.manager_id with m.id to link employees to their managers.
  2. Filter Condition:
    • e.salary > m.salary: Ensures we only select employees earning more than their managers.
  3. Select Columns:
    • Display relevant columns, including the employee’s name and salary, along with their manager’s name and salary.
Pages ( 3 of 11 ): « Previous12 3 45 ... 11Next »