A running total is the cumulative sum of values in a column, calculated progressively across rows in a specific order. In Spark SQL, you can compute this using the SUM function with a WINDOW clause.
Example in Spark SQL:
Assume we have a table sales with columns date and amount. To calculate a running total of amount:
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM
sales;
This SUM function with the OVER clause generates a cumulative total from the beginning to the current row based on the date column order.
In Pyspark Dataframe API-
from pyspark.sql import Window
import pyspark.sql.functions as F
Define a window spec with ordering
window_spec = Window.orderBy(“order_column”).rowsBetween(Window.unboundedPreceding, Window.currentRow)
Calculate running total
df = df.withColumn(“running_total”, F.sum(“amount_column”).over(window_spec))