What is running total and how to calculate in spark sql / Pyspark?

This topic has 0 replies, 1 voice, and was last updated 4 months ago by Team AHT.

Viewing 1 post (of 1 total)

Author

Posts
November 2, 2024 at 2:06 am #4058

Team AHT
Keymaster

A running total is the cumulative sum of values in a column, calculated progressively across rows in a specific order. In Spark SQL, you can compute this using the SUM function with a WINDOW clause.

Example in Spark SQL:
Assume we have a table sales with columns date and amount. To calculate a running total of amount:

SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM
sales;

This SUM function with the OVER clause generates a cumulative total from the beginning to the current row based on the date column order.

In Pyspark Dataframe API-

from pyspark.sql import Window
import pyspark.sql.functions as F

Define a window spec with ordering

window_spec = Window.orderBy(“order_column”).rowsBetween(Window.unboundedPreceding, Window.currentRow)

Calculate running total

df = df.withColumn(“running_total”, F.sum(“amount_column”).over(window_spec))
Author

Posts

Viewing 1 post (of 1 total)

You must be logged in to reply to this topic.