This is how PIVOT and UNPIVOT work in PySpark with Spark SQL using official syntax.

1. PIVOT in Spark SQL

The PIVOT operation in Spark SQL allows you to convert rows into columns based on values in a specific column. It helps to summarize and reshape the data in a tabular format.

Syntax:

SELECT * FROM table_name
  PIVOT (
    <aggregate_function>(<column_name>)
    FOR <pivot_column> IN (<pivot_values>)
  )

Example: Pivoting Sales Data

Consider the following sales table:

regionproductsales_amountyear
NorthProduct A1002023
NorthProduct B1502023
SouthProduct A2002023
SouthProduct B2502023

You can pivot this data by product to create columns for Product A and Product B.

SELECT * FROM sales_data
  PIVOT (
    SUM(sales_amount)
    FOR product IN ('Product A', 'Product B')
  )

Result:

regionProduct AProduct B
North100150
South200250

In this query:

  • We are pivoting the product column, creating two new columns: Product A and Product B.
  • The aggregation function used is SUM() to summarize the sales_amount for each product.

2. UNPIVOT in Spark SQL

The UNPIVOT operation in Spark SQL allows you to convert columns into rows. It’s typically used to transform a wide table back into a tall table.

Syntax:

SELECT * FROM table_name
  UNPIVOT (
    <column_name>
    FOR <unpivot_column> IN (<columns_to_unpivot>)
  )

Example: Unpivoting the Pivoted Sales Data

Let’s take the pivoted table from the previous example and unpivot it back to its original format.

regionProduct AProduct B
North100150
South200250

We want to unpivot the Product A and Product B columns into rows.

SELECT * FROM pivoted_sales
  UNPIVOT (
    sales_amount
    FOR product IN (`Product A`, `Product B`)
  )

Result:

regionproductsales_amount
NorthProduct A100
NorthProduct B150
SouthProduct A200
SouthProduct B250

In this query:

  • We’re converting the columns Product A and Product B back into rows with product as a new column.
  • The sales_amount is the value for each unpivoted row.

SAS PROC TRANSPOSE and Its Equivalent in PySpark

SAS PROC TRANSPOSE is commonly used for pivoting and unpivoting in SAS. As we now know, we can achieve this functionality in PySpark with the PIVOT and UNPIVOT operations.

SAS PROC TRANSPOSE Example:

PROC TRANSPOSE DATA=sales_data OUT=pivoted_sales;
    BY region;
    ID product;
    VAR sales_amount;
RUN;

Equivalent Spark SQL PIVOT:

SELECT * FROM sales_data
  PIVOT (
    SUM(sales_amount)
    FOR product IN ('Product A', 'Product B')
  )

SAS PROC TRANSPOSE for Unpivot:

To unpivot in SAS, you can reverse the process with multiple steps. In Spark SQL, you can use the UNPIVOT operation directly:

SELECT * FROM pivoted_sales
  UNPIVOT (
    sales_amount
    FOR product IN (`Product A`, `Product B`)
  )

  • PySpark and Spark SQL indeed have PIVOT and UNPIVOT functionalities, and they can be used in a similar way to other databases or tools like SAS PROC TRANSPOSE.
  • The PIVOT function reshapes data by turning rows into columns, while UNPIVOT reverses this process by converting columns back into rows.

We can refer to the official Spark documentation for more details:


Discover more from AI HintsToday

Subscribe to get the latest posts sent to your email.

Discover more from AI HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading