What is running total and how to calculate in spark sql / Pyspark?

Home Forums SQL Discussion SQL Window Functions What is running total and how to calculate in spark sql / Pyspark?

Viewing 1 post (of 1 total)
  • Author
    Posts
  • #4058
    Team AHT
    Keymaster

    A running total is the cumulative sum of values in a column, calculated progressively across rows in a specific order. In Spark SQL, you can compute this using the SUM function with a WINDOW clause.

    Example in Spark SQL:
    Assume we have a table sales with columns date and amount. To calculate a running total of amount:

    SELECT
    date,
    amount,
    SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
    FROM
    sales;

    This SUM function with the OVER clause generates a cumulative total from the beginning to the current row based on the date column order.

    In Pyspark Dataframe API-

    from pyspark.sql import Window
    import pyspark.sql.functions as F

    Define a window spec with ordering

    window_spec = Window.orderBy(“order_column”).rowsBetween(Window.unboundedPreceding, Window.currentRow)

    Calculate running total

    df = df.withColumn(“running_total”, F.sum(“amount_column”).over(window_spec))

Viewing 1 post (of 1 total)
  • You must be logged in to reply to this topic.