Examples
1. Rank Employees by Salary in Each Department
SELECT EmpID,
Emp_name,
Department,
Salary,
RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank
FROM Employee;
Explanation:
PARTITION BY Department
: Separate ranking by department.ORDER BY Salary DESC
: Ranks employees by salary in descending order.
2. Calculate Cumulative Sum of Sales
SELECT CustomerID,
OrderDate,
ProductID,
SUM(Sales) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS cumulative_sales
FROM Orders;
Explanation:
PARTITION BY CustomerID
: Cumulative sales calculated per customer.ORDER BY OrderDate
: Sales are accumulated in chronological order.
3. Find Running Average Salary
SELECT EmpID,
Emp_name,
Department,
AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS running_avg
FROM Employee;
Explanation:
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
: Calculates the average salary of the current row and the two preceding rows.
4. Lead and Lag Example
SELECT EmpID,
Emp_name,
Department,
Salary,
LAG(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS previous_salary,
LEAD(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS next_salary
FROM Employee;
Explanation:
LAG(Salary, 1, 0)
: Fetches the previous salary within the partition. Default is0
if there is no previous value.LEAD(Salary, 1, 0)
: Fetches the next salary within the partition. Default is0
if there is no next value.
5. Find Employees Above Average Salary in Their Department
SELECT EmpID,
Emp_name,
Department,
Salary,
AVG(Salary) OVER (PARTITION BY Department) AS avg_salary
FROM Employee
WHERE Salary > AVG(Salary) OVER (PARTITION BY Department);
Explanation:
AVG(Salary) OVER (PARTITION BY Department)
: Calculates average salary for each department.- The
WHERE
clause filters employees whose salary is above the department average.
Notes
- Spark SQL does not support
DISTINCT
in window functions.- For example,
COUNT(DISTINCT column) OVER (...)
is not allowed in Spark SQL.
- For example,
- Optimized for Distributed Computing:
- Unlike traditional SQL, Spark SQL handles large-scale datasets by distributing computations across the cluster.
- Frame Specification:
- Use
ROWS
for row-level computations. - Use
RANGE
for value-based computations.
- Use
Leave a Reply