Window functions in PySpark on Dataframe programming

by lochan2014 | Dec 5, 2024 | Pyspark | 0 comments

There are multiple ways to apply window functions on DataFrames in PySpark. While withColumn is the most commonly used method to add a new column with a window function, there are other approaches to apply window functions, depending on the specific use case.

Here are different methods for applying window functions:

1. Using `select()` with window functions

Instead of withColumn(), you can use select() to directly apply a window function to the columns of the DataFrame. This is useful when you only want to return a subset of columns along with the windowed column.

from pyspark.sql.functions import row_number
from pyspark.sql import Window

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# Use select to apply the window function
df.select("id", "salary", row_number().over(windowSpec).alias("row_number")).show()

2. Using `agg()` with window functions

Window functions can also be applied when performing aggregations (agg()). This is useful when you want to calculate aggregated metrics (e.g., sum, avg) over a window.

from pyspark.sql.functions import sum
from pyspark.sql import Window

# Define the window specification
windowSpec = Window.partitionBy("salary")

# Apply window function during aggregation
df.groupBy("id").agg(sum("salary").over(windowSpec).alias("total_salary")).show()

3. Using `filter()` or `where()`

Sometimes, window functions are used in conjunction with filters to extract specific rows, such as filtering the first or last row per partition.

from pyspark.sql.functions import row_number

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# Apply window function and filter based on the rank
ranked_df = df.withColumn("row_number", row_number().over(windowSpec))
ranked_df.filter(ranked_df["row_number"] == 1).show()  # Filter to get the first row per partition

4. Using `groupBy()` with window functions

Though groupBy() is usually used for aggregations, you can combine it with window functions. Window functions won’t replace groupBy(), but you can apply them after aggregations.

from pyspark.sql.functions import rank

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# First, group by some column and then apply a window function
grouped_df = df.groupBy("salary").count()
grouped_df.withColumn("rank", rank().over(windowSpec)).show()

5. Using `withColumnRenamed()` with window functions

You can also rename the result of a window function when adding it as a new column.

from pyspark.sql.functions import row_number

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# Apply the window function and rename the column
df.withColumn("row_number", row_number().over(windowSpec)).withColumnRenamed("row_number", "rank").show()

6. Combining multiple window functions in one step

You can apply multiple window functions in a single step using either select() or withColumn().

from pyspark.sql.functions import row_number, rank

# Define the window specification
windowSpec = Window.partitionBy("salary").orderBy("id")

# Apply multiple window functions
df.select("id", "salary",
          row_number().over(windowSpec).alias("row_number"),
          rank().over(windowSpec).alias("rank")
         ).show()

Comparison with Spark SQL Window Functions

All the above operations can also be performed using Spark SQL. Here are the equivalent SQL queries:

1. Row Number

SELECT id, name, salary,
       ROW_NUMBER() OVER (PARTITION BY salary ORDER BY id) AS row_number
FROM df

2. Rank

SELECT id, name, salary,
       RANK() OVER (PARTITION BY salary ORDER BY id) AS rank
FROM df

3. Dense Rank

SELECT id, name, salary,
       DENSE_RANK() OVER (PARTITION BY salary ORDER BY id) AS dense_rank
FROM df

4. Percent Rank

SELECT id, name, salary,
       PERCENT_RANK() OVER (PARTITION BY salary ORDER BY id) AS percent_rank
FROM df

5. NTile

SELECT id, name, salary,
       NTILE(4) OVER (PARTITION BY salary ORDER BY id) AS ntile
FROM df

6. Lag

SELECT id, name, salary,
       LAG(salary, 1) OVER (PARTITION BY salary ORDER BY id) AS lag
FROM df

7. Lead

SELECT id, name, salary,
       LEAD(salary, 1) OVER (PARTITION BY salary ORDER BY id) AS lead
FROM df

8. Cumulative Sum

SELECT id, name, salary,
       SUM(salary) OVER (PARTITION BY salary ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum
FROM df

9. Moving Average

SELECT id, name, salary,
       AVG(salary) OVER (PARTITION BY salary ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS moving_avg
FROM df

Head to the Next Page

← Spark SQL windows Function and Best UsecasesDate and Time Functions- Pyspark Dataframes & Pyspark Sql Queries →

Written By

undefined

Python ALL Eyes on Strings- String Data Type & For Loop Combined

Feb 5, 2025 | Python

Here’s a comprehensive Python string function cheat sheet in tabular format: FunctionSyntaxDescriptionExampleReturn Typecapitalizestr.capitalize()Capitalizes the first character of the string.”hello”.capitalize() → “Hello”strcasefoldstr.casefold()Converts to…

How to Solve a Coding Problem in Python? Step to Step Guide?

Feb 1, 2025 | Python

Solving coding problems efficiently requires a structured approach. Here’s a step-by-step guide along with shortcuts and pseudocode tips. 📌 Step 1: Understand the Problem Clearly Read the problem statement carefully Identify: Input format (list, string, integer,…

Python Built-in Iterables: Complete Guide with Use Cases & Challenges

Feb 1, 2025 | Python

What are Iterables? An iterable is any object that can return an iterator, meaning it can be looped over using for loops or passed to functions like map(), filter(), etc. 🔹 List of Built-in Iterables in Python Python provides several built-in iterable objects:…

Window functions in PySpark on Dataframe programming

1. Using `select()` with window functions

2. Using `agg()` with window functions

3. Using `filter()` or `where()`

4. Using `groupBy()` with window functions

5. Using `withColumnRenamed()` with window functions

6. Combining multiple window functions in one step

Comparison with Spark SQL Window Functions

1. Row Number

2. Rank

3. Dense Rank

4. Percent Rank

5. NTile

6. Lag

7. Lead

8. Cumulative Sum

9. Moving Average

Head to the Next Page

Written By

Related Posts

Python ALL Eyes on Strings- String Data Type & For Loop Combined

How to Solve a Coding Problem in Python? Step to Step Guide?

Python Built-in Iterables: Complete Guide with Use Cases & Challenges

Submit a Comment Cancel reply

Explore Our Tutorials

Connect With Us

About HintsToday

Success!

Window functions in PySpark on Dataframe programming

1. Using select() with window functions

2. Using agg() with window functions

3. Using filter() or where()

4. Using groupBy() with window functions

5. Using withColumnRenamed() with window functions

6. Combining multiple window functions in one step

Comparison with Spark SQL Window Functions

1. Row Number

2. Rank

3. Dense Rank

4. Percent Rank

5. NTile

6. Lag

7. Lead

8. Cumulative Sum

9. Moving Average

Head to the Next Page

Written By

Related Posts

Python ALL Eyes on Strings- String Data Type & For Loop Combined

How to Solve a Coding Problem in Python? Step to Step Guide?

Python Built-in Iterables: Complete Guide with Use Cases & Challenges

Submit a Comment Cancel reply

Success!

1. Using `select()` with window functions

2. Using `agg()` with window functions

3. Using `filter()` or `where()`

4. Using `groupBy()` with window functions

5. Using `withColumnRenamed()` with window functions