Spark SQL windows Function and Best Usecases

Examples

1. Rank Employees by Salary in Each Department

SELECT EmpID, 
       Emp_name, 
       Department, 
       Salary, 
       RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank
FROM Employee;

Explanation:

PARTITION BY Department: Separate ranking by department.
ORDER BY Salary DESC: Ranks employees by salary in descending order.

2. Calculate Cumulative Sum of Sales

SELECT CustomerID, 
       OrderDate, 
       ProductID, 
       SUM(Sales) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS cumulative_sales
FROM Orders;

Explanation:

PARTITION BY CustomerID: Cumulative sales calculated per customer.
ORDER BY OrderDate: Sales are accumulated in chronological order.

3. Find Running Average Salary

SELECT EmpID, 
       Emp_name, 
       Department, 
       AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS running_avg
FROM Employee;

Explanation:

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW: Calculates the average salary of the current row and the two preceding rows.

4. Lead and Lag Example

SELECT EmpID, 
       Emp_name, 
       Department, 
       Salary, 
       LAG(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS previous_salary,
       LEAD(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS next_salary
FROM Employee;

Explanation:

LAG(Salary, 1, 0): Fetches the previous salary within the partition. Default is 0 if there is no previous value.
LEAD(Salary, 1, 0): Fetches the next salary within the partition. Default is 0 if there is no next value.

5. Find Employees Above Average Salary in Their Department

SELECT EmpID, 
       Emp_name, 
       Department, 
       Salary, 
       AVG(Salary) OVER (PARTITION BY Department) AS avg_salary
FROM Employee
WHERE Salary > AVG(Salary) OVER (PARTITION BY Department);

Explanation:

AVG(Salary) OVER (PARTITION BY Department): Calculates average salary for each department.
The WHERE clause filters employees whose salary is above the department average.

Notes

Spark SQL does not support DISTINCT in window functions.
- For example, COUNT(DISTINCT column) OVER (...) is not allowed in Spark SQL.
Optimized for Distributed Computing:
- Unlike traditional SQL, Spark SQL handles large-scale datasets by distributing computations across the cluster.
Frame Specification:
- Use ROWS for row-level computations.
- Use RANGE for value-based computations.

Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Pages: 1 2 3

Pages ( 2 of 3 ): « Previous 1 2 3 Next »

1 Comment

Rajeev on November 30, 2024 at 10:05 am

Good Content and Good looking website it is!

Spark SQL windows Function and Best Usecases

Examples

1. Rank Employees by Salary in Each Department

2. Calculate Cumulative Sum of Sales

3. Find Running Average Salary

4. Lead and Lag Example

5. Find Employees Above Average Salary in Their Department

Notes

Share this- Make us Famous:-

Discover more from HintsToday

1 Comment

Submit a Comment Cancel reply

Discover more from HintsToday