Spark SQL windows Function and Best Usecases

Examples

1. Rank Employees by Salary in Each Department

SELECT EmpID, 
       Emp_name, 
       Department, 
       Salary, 
       RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank
FROM Employee;

Explanation:

  • PARTITION BY Department: Separate ranking by department.
  • ORDER BY Salary DESC: Ranks employees by salary in descending order.

2. Calculate Cumulative Sum of Sales

SELECT CustomerID, 
OrderDate,
ProductID,
SUM(Sales) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS cumulative_sales
FROM Orders;

Explanation:

  • PARTITION BY CustomerID: Cumulative sales calculated per customer.
  • ORDER BY OrderDate: Sales are accumulated in chronological order.

3. Find Running Average Salary

SELECT EmpID, 
       Emp_name, 
       Department, 
       AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS running_avg
FROM Employee;

Explanation:

  • ROWS BETWEEN 2 PRECEDING AND CURRENT ROW: Calculates the average salary of the current row and the two preceding rows.

4. Lead and Lag Example

SELECT EmpID, 
       Emp_name, 
       Department, 
       Salary, 
       LAG(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS previous_salary,
       LEAD(Salary, 1, 0) OVER (PARTITION BY Department ORDER BY Salary) AS next_salary
FROM Employee;

Explanation:

  • LAG(Salary, 1, 0): Fetches the previous salary within the partition. Default is 0 if there is no previous value.
  • LEAD(Salary, 1, 0): Fetches the next salary within the partition. Default is 0 if there is no next value.

5. Find Employees Above Average Salary in Their Department

SELECT EmpID, 
       Emp_name, 
       Department, 
       Salary, 
       AVG(Salary) OVER (PARTITION BY Department) AS avg_salary
FROM Employee
WHERE Salary > AVG(Salary) OVER (PARTITION BY Department);

Explanation:

  • AVG(Salary) OVER (PARTITION BY Department): Calculates average salary for each department.
  • The WHERE clause filters employees whose salary is above the department average.

Notes

  1. Spark SQL does not support DISTINCT in window functions.
    • For example, COUNT(DISTINCT column) OVER (...) is not allowed in Spark SQL.
  2. Optimized for Distributed Computing:
    • Unlike traditional SQL, Spark SQL handles large-scale datasets by distributing computations across the cluster.
  3. Frame Specification:
    • Use ROWS for row-level computations.
    • Use RANGE for value-based computations.

Pages: 1 2 3

Pages ( 2 of 3 ): « Previous1 2 3Next »

One response

  1. Rajeev Avatar

    Good Content and Good looking website it is!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Latest Posts

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading