Employees Earning More than Their Managers– in Pyspark sql and dataframe api
To find employees earning more than their managers in PySpark, we’ll need a table that includes both employees and their managers, along with their salaries. Here’s how you can solve this using PySpark SQL and the DataFrame API.
Sample Data
Assume the following schema for the employees
DataFrame:
id
(int): Employee IDname
(string): Employee namesalary
(int): Employee salarymanager_id
(int): Manager ID (refers to theid
of the manager in the same table)
Sample Data:
+----+-------+------+-----------+
| id | name | salary| manager_id|
+----+-------+------+-----------+
| 1 | John | 70000 | NULL |
| 2 | Alice | 60000 | 1 |
| 3 | Bob | 80000 | 1 |
| 4 | Carol | 90000 | 2 |
| 5 | Dave | 50000 | 2 |
+----+-------+------+-----------+
Using PySpark SQL
- Create a self-join on the
employees
table where themanager_id
matches theid
of another employee. - Filter employees whose
salary
is greater than their manager’ssalary
.
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.master("local").appName("EmployeesMoreThanManagers").getOrCreate()
# Create DataFrame
data = [
(1, "John", 70000, None),
(2, "Alice", 60000, 1),
(3, "Bob", 80000, 1),
(4, "Carol", 90000, 2),
(5, "Dave", 50000, 2)
]
columns = ["id", "name", "salary", "manager_id"]
employees = spark.createDataFrame(data, columns)
# Register the DataFrame as a temporary SQL table
employees.createOrReplaceTempView("employees")
# PySpark SQL Query
query = """
SELECT e.name AS employee_name, e.salary AS employee_salary, m.name AS manager_name, m.salary AS manager_salary
FROM employees e
JOIN employees m
ON e.manager_id = m.id
WHERE e.salary > m.salary
"""
result_sql = spark.sql(query)
result_sql.show()
Output:
+--------------+----------------+-------------+---------------+
| employee_name| employee_salary| manager_name| manager_salary|
+--------------+----------------+-------------+---------------+
| Bob | 80000 | John | 70000 |
| Carol | 90000 | Alice | 60000 |
+--------------+----------------+-------------+---------------+
Using PySpark DataFrame API
from pyspark.sql.functions import col
# Self-join on manager_id and id
joined_df = employees.alias("e").join(
employees.alias("m"),
col("e.manager_id") == col("m.id"),
"inner"
)
# Filter employees earning more than their managers
result_df = joined_df.filter(col("e.salary") > col("m.salary")) \
.select(
col("e.name").alias("employee_name"),
col("e.salary").alias("employee_salary"),
col("m.name").alias("manager_name"),
col("m.salary").alias("manager_salary")
)
result_df.show()
Output:
+--------------+----------------+-------------+---------------+
| employee_name| employee_salary| manager_name| manager_salary|
+--------------+----------------+-------------+---------------+
| Bob | 80000 | John | 70000 |
| Carol | 90000 | Alice | 60000 |
+--------------+----------------+-------------+---------------+
Explanation
- Self-Join:
e
(alias for employees) represents the employee.m
(alias for employees) represents the manager.- Join
e.manager_id
withm.id
to link employees to their managers.
- Filter Condition:
e.salary > m.salary
: Ensures we only select employees earning more than their managers.
- Select Columns:
- Display relevant columns, including the employee’s name and salary, along with their manager’s name and salary.