For Better understanding on Spark SQL windows Function and Best Usecases do refer our post Window functions in Oracle Pl/Sql and Hive explained and compared with examples.
Window functions in Spark SQL are powerful tools that allow you to perform calculations across a set of table rows that are somehow related to the current row. These functions are particularly useful for tasks that require a global view of the data, such as running totals, ranking, and time-series analysis.
Overview of Window Functions
In Spark SQL, window functions are used with the OVER
clause and can be categorized into several types:
- Ranking Functions:
ROW_NUMBER()
: Assigns a unique number to each row within the partition of a result set.RANK()
: Assigns a rank to each row within the partition of a result set, with gaps for ties.DENSE_RANK()
: Similar toRANK()
, but without gaps between the ranks.NTILE(n)
: Distributes the rows into a specified number of groups and assigns a group number to each row.
- Analytic Functions:
CUME_DIST()
: Computes the cumulative distribution of a value in a group of values.PERCENT_RANK()
: Computes the rank of a value in a group of values as a percentage.
- Aggregate Functions:
SUM()
,AVG()
,MIN()
,MAX()
,COUNT()
: These functions can be used as window functions to perform aggregations over a set of rows.
- Value Functions:
LAG()
: Accesses data from a previous row in the same result set.LEAD()
: Accesses data from a subsequent row in the same result set.FIRST_VALUE()
: Returns the first value in an ordered set of values.LAST_VALUE()
: Returns the last value in an ordered set of values.
Window Specification
To use a window function, you need to define a window specification that includes:
- Partitioning: Defines the subsets of data to which the function is applied.
- Ordering: Defines the order of rows within each partition.
- Frame: Defines the subset of rows relative to the current row.
Basic Syntax of Window Fuction in Spark SQL
SELECT column1, column2,
aggregate_function(column) OVER (PARTITION BY column_name ORDER BY column_name [ROWS/RANGE frame_clause]) AS alias
FROM table_name;
Components of a Window Function
- Aggregate or Analytical Function:
- Examples:
SUM()
,AVG()
,COUNT()
,ROW_NUMBER()
,RANK()
, etc.
- Examples:
OVER
Clause:- Specifies the window over which the function operates.
PARTITION BY
Clause (optional):- Divides the dataset into partitions.
- Each partition is processed independently.
ORDER BY
Clause (optional):- Specifies the ordering of rows within a partition.
- Frame Specification (optional):
- Defines the range of rows within a partition for computation.
- Options:
ROWS
,RANGE
.
1. PARTITION BY
Clause
- Divides the dataset into groups (partitions) for which the window function is applied independently.
- Syntax:
PARTITION BY column_name
- Example:
AVG(Salary) OVER (PARTITION BY Department)
This computes the average salary separately for each department.
2. ORDER BY
Clause
- Defines the order of rows within each partition.
- Syntax:
ORDER BY column_name [ASC | DESC] [NULLS FIRST | NULLS LAST]
ASC
: Ascending order (default).DESC
: Descending order.NULLS FIRST
: PlacesNULL
values at the beginning of the order.NULLS LAST
: PlacesNULL
values at the end of the order.
- Example:
RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS LAST)
3. Frame Specification
- Defines the range of rows to consider for the window function.
- Frame types:
ROWS
: Based on row positions.RANGE
: Based on value ranges.
Frame Syntax
[ROWS | RANGE] BETWEEN frame_start AND frame_end
Frame Options
UNBOUNDED PRECEDING
:- Includes all rows from the beginning of the partition.
- Example:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
.
CURRENT ROW
:- Includes the current row only.
- Example:
ROWS BETWEEN CURRENT ROW AND CURRENT ROW
.
UNBOUNDED FOLLOWING
:- Includes all rows until the end of the partition.
- Example:
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
.
- Specific Number of Rows:
- Includes a fixed number of preceding or following rows.
- Example:
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING
.
Examples
- Cumulative sum:
SUM(Sales) OVER (PARTITION BY CustomerID ORDER BY OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
- Moving average:A
VG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
4. Handling NULL Values in ORDER BY
- By default:
NULLS FIRST
for ascending order.NULLS LAST
for descending order.
Example
RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS FIRST)
5. Supported Functions with OVER
Clause
Spark SQL supports many window functions, including:
Aggregate Functions
SUM()
,AVG()
,COUNT()
,MIN()
,MAX()
Ranking Functions
ROW_NUMBER()
,RANK()
,DENSE_RANK()
,NTILE(n)
Value Functions
FIRST_VALUE()
,LAST_VALUE()
,LEAD(column, offset, default)
,LAG(column, offset, default)
6. Combining PARTITION BY
, ORDER BY
, and Frames
You can combine all components for more control:
Example:
SELECT EmpID,
Department,
Salary,
AVG(Salary) OVER (PARTITION BY Department ORDER BY Salary ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS running_avg,
RANK() OVER (PARTITION BY Department ORDER BY Salary DESC NULLS LAST) AS salary_rank
FROM Employee;
This example:
- Computes a running average for the last three salaries within each department.
- Ranks employees by salary in descending order, placing
NULL
salaries at the end.
Summary of Options
Clause | Description | Syntax Example |
---|---|---|
PARTITION BY | Divides rows into partitions | PARTITION BY Department |
ORDER BY | Orders rows within partitions | ORDER BY Salary DESC NULLS FIRST |
ROWS | Row-based range for window frame | ROWS BETWEEN 2 PRECEDING AND CURRENT ROW |
RANGE | Value-based range for window frame | RANGE BETWEEN INTERVAL 1 DAY PRECEDING AND CURRENT ROW |
NULLS | Specifies how nulls are handled in ordering | ORDER BY Salary DESC NULLS LAST |
By combining these options, Spark SQL allows flexible and efficient computation over distributed datasets!
Leave a Reply