PIVOT Clause in Spark sql or Mysql or Oracle Pl sql or Hive QL
The PIVOT clause is a powerful tool in SQL that allows you to rotate rows into columns, making it easier to analyze and report data. Here’s how to use the PIVOT clause in Spark SQL, MySQL, Oracle PL/SQL, and Hive QL:
Spark SQL
In Spark SQL, you can use the pivot
function to rotate rows into columns.
SELECT *
FROM (
SELECT column1, column2, value
FROM your_table
)
PIVOT (
SUM(value) FOR column2 IN (${list_of_values})
) AS pivot_table;
Replace ${list_of_values}
with a comma-separated list of values from column2
that you want to pivot.
Example:
SELECT *
FROM (
SELECT name, year, sales
FROM sales_table
)
PIVOT (
SUM(sales) FOR year IN (2018, 2019, 2020)
) AS pivot_table;
MySQL
MySQL does not support the PIVOT clause directly. However, you can achieve the same result using conditional aggregation.
SELECT
column1,
SUM(IF(column2 = 'value1', value, 0)) AS value1,
SUM(IF(column2 = 'value2', value, 0)) AS value2,
...
FROM your_table
GROUP BY column1;
Replace value1
, value2
, etc. with the values from column2
that you want to pivot.
Example:
SELECT
name,
SUM(IF(year = 2018, sales, 0)) AS sales_2018,
SUM(IF(year = 2019, sales, 0)) AS sales_2019,
SUM(IF(year = 2020, sales, 0)) AS sales_2020
FROM sales_table
GROUP BY name;
In Oracle PL/SQL, you can use the PIVOT
clause.
SELECT *
FROM (
SELECT column1, column2, value
FROM your_table
)
PIVOT (
SUM(value) FOR column2 IN (${list_of_values})
);
Replace ${list_of_values}
with a comma-separated list of values from column2
that you want to pivot.
Example:
SELECT *
FROM (
SELECT name, year, sales
FROM sales_table
)
PIVOT (
SUM(sales) FOR year IN (2018, 2019, 2020)
);
Hive QL
In Hive QL, you can use the pivot
clause.
SELECT *
FROM (
SELECT column1, column2, value
FROM your_table
)
PIVOT (
SUM(value) FOR column2 IN (${list_of_values})
) AS pivot_table;
Replace ${list_of_values}
with a comma-separated list of values from column2
that you want to pivot.
Example:
SELECT *
FROM (
SELECT name, year, sales
FROM sales_table
)
PIVOT (
SUM(sales) FOR year IN (2018, 2019, 2020)
) AS pivot_table;
Common considerations
- Make sure to replace
column1
,column2
,value
, andyour_table
with the actual column names and table name. - The list of values for the
IN
clause should be enclosed in parentheses and separated by commas. - The
PIVOT
clause can be combined with other SQL clauses, such asWHERE
,GROUP BY
, andHAVING
. - Be mindful of data types and formatting when using the
PIVOT
clause.
When to use PIVOT:
- When you need to transform rows into columns for reporting or analysis purposes.
- When you need to create a crosstab or matrix report.
- When you need to perform aggregations on multiple categories.
When not to use PIVOT:
- When the number of distinct values in the pivot column is very large.
- When the data is too complex or nested.
- When you need to perform complex calculations or transformations.
This is how PIVOT
and UNPIVOT
work in Spark SQL using official syntax.
1. PIVOT in Spark SQL/ or in Pyspark
The PIVOT
operation in Spark SQL allows you to convert rows into columns based on values in a specific column. It helps to summarize and reshape the data in a tabular format.
Syntax:
SELECT * FROM table_name
PIVOT (
<aggregate_function>(<column_name>)
FOR <pivot_column> IN (<pivot_values>)
)
Example: Pivoting Sales Data
Consider the following sales table:
region | product | sales_amount | year |
---|---|---|---|
North | Product A | 100 | 2023 |
North | Product B | 150 | 2023 |
South | Product A | 200 | 2023 |
South | Product B | 250 | 2023 |
You can pivot this data by product to create columns for Product A
and Product B
.
SELECT * FROM sales_data
PIVOT (
SUM(sales_amount)
FOR product IN ('Product A', 'Product B')
)
Result:
region | Product A | Product B |
---|---|---|
North | 100 | 150 |
South | 200 | 250 |
In this query:
- We are pivoting the
product
column, creating two new columns:Product A
andProduct B
. - The aggregation function used is
SUM()
to summarize thesales_amount
for each product.
2. UNPIVOT in Spark SQL
The UNPIVOT
operation in Spark SQL allows you to convert columns into rows. It’s typically used to transform a wide table back into a tall table.
Syntax:
SELECT * FROM table_name
UNPIVOT (
<column_name>
FOR <unpivot_column> IN (<columns_to_unpivot>)
)
Example: Unpivoting the Pivoted Sales Data
Let’s take the pivoted table from the previous example and unpivot it back to its original format.
region | Product A | Product B |
---|---|---|
North | 100 | 150 |
South | 200 | 250 |
We want to unpivot the Product A
and Product B
columns into rows.
SELECT * FROM pivoted_sales
UNPIVOT (
sales_amount
FOR product IN (`Product A`, `Product B`)
)
Result:
region | product | sales_amount |
---|---|---|
North | Product A | 100 |
North | Product B | 150 |
South | Product A | 200 |
South | Product B | 250 |
In this query:
- We’re converting the columns
Product A
andProduct B
back into rows withproduct
as a new column. - The
sales_amount
is the value for each unpivoted row.
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
SELECT * FROM person
PIVOT (
SUM(age) AS a, AVG(class) AS c
FOR name IN ('John' AS john, 'Mike' AS mike)
);
+------+-----------+---------+---------+---------+---------+
| id | address | john_a | john_c | mike_a | mike_c |
+------+-----------+---------+---------+---------+---------+
| 200 | Street 2 | NULL | NULL | NULL | NULL |
| 100 | Street 1 | 30 | 1.0 | NULL | NULL |
| 300 | Street 3 | NULL | NULL | 80 | 3.0 |
| 400 | Street 4 | NULL | NULL | NULL | NULL |
+------+-----------+---------+---------+---------+---------+
SELECT * FROM person
PIVOT (
SUM(age) AS a, AVG(class) AS c
FOR (name, age) IN (('John', 30) AS c1, ('Mike', 40) AS c2)
);
+------+-----------+-------+-------+-------+-------+
| id | address | c1_a | c1_c | c2_a | c2_c |
+------+-----------+-------+-------+-------+-------+
| 200 | Street 2 | NULL | NULL | NULL | NULL |
| 100 | Street 1 | 30 | 1.0 | NULL | NULL |
| 300 | Street 3 | NULL | NULL | NULL | NULL |
| 400 | Street 4 | NULL | NULL | NULL | NULL |
+------+-----------+-------+-------+-------+-------+
CREATE TABLE sales_quarterly (year INT, q1 INT, q2 INT, q3 INT, q4 INT);
INSERT INTO sales_quarterly VALUES
(2020, null, 1000, 2000, 2500),
(2021, 2250, 3200, 4200, 5900),
(2022, 4200, 3100, null, null);
-- column names are used as unpivot columns
SELECT * FROM sales_quarterly
UNPIVOT (
sales FOR quarter IN (q1, q2, q3, q4)
);
+------+---------+-------+
| year | quarter | sales |
+------+---------+-------+
| 2020 | q2 | 1000 |
| 2020 | q3 | 2000 |
| 2020 | q4 | 2500 |
| 2021 | q1 | 2250 |
| 2021 | q2 | 3200 |
| 2021 | q3 | 4200 |
| 2021 | q4 | 5900 |
| 2022 | q1 | 4200 |
| 2022 | q2 | 3100 |
+------+---------+-------+
-- NULL values are excluded by default, they can be included
-- unpivot columns can be alias
-- unpivot result can be referenced via its alias
SELECT up.* FROM sales_quarterly
UNPIVOT INCLUDE NULLS (
sales FOR quarter IN (q1 AS Q1, q2 AS Q2, q3 AS Q3, q4 AS Q4)
) AS up;
+------+---------+-------+
| year | quarter | sales |
+------+---------+-------+
| 2020 | Q1 | NULL |
| 2020 | Q2 | 1000 |
| 2020 | Q3 | 2000 |
| 2020 | Q4 | 2500 |
| 2021 | Q1 | 2250 |
| 2021 | Q2 | 3200 |
| 2021 | Q3 | 4200 |
| 2021 | Q4 | 5900 |
| 2022 | Q1 | 4200 |
| 2022 | Q2 | 3100 |
| 2022 | Q3 | NULL |
| 2022 | Q4 | NULL |
+------+---------+-------+
-- multiple value columns can be unpivoted per row
SELECT * FROM sales_quarterly
UNPIVOT EXCLUDE NULLS (
(first_quarter, second_quarter)
FOR half_of_the_year IN (
(q1, q2) AS H1,
(q3, q4) AS H2
)
);
+------+------------------+---------------+----------------+
| id | half_of_the_year | first_quarter | second_quarter |
+------+------------------+---------------+----------------+
| 2020 | H1 | NULL | 1000 |
| 2020 | H2 | 2000 | 2500 |
| 2021 | H1 | 2250 | 3200 |
| 2021 | H2 | 4200 | 5900 |
| 2022 | H1 | 4200 | 3100 |
+------+------------------+---------------+----------------+
--Example taken from Official Spark SQL Documentation
SAS PROC TRANSPOSE and Its Equivalent in PySpark
SAS PROC TRANSPOSE is commonly used for pivoting and unpivoting in SAS. As we now know, we can achieve this functionality in PySpark with the PIVOT
and UNPIVOT
operations.
SAS PROC TRANSPOSE Example:
PROC TRANSPOSE DATA=sales_data OUT=pivoted_sales;
BY region;
ID product;
VAR sales_amount;
RUN;
Equivalent Spark SQL PIVOT:
SELECT * FROM sales_data
PIVOT (
SUM(sales_amount)
FOR product IN ('Product A', 'Product B')
)
SAS PROC TRANSPOSE for Unpivot:
To unpivot in SAS, you can reverse the process with multiple steps. In Spark SQL, you can use the UNPIVOT
operation directly:
SELECT * FROM pivoted_sales
UNPIVOT (
sales_amount
FOR product IN (`Product A`, `Product B`)
)
- PySpark and Spark SQL indeed have
PIVOT
andUNPIVOT
functionalities, and they can be used in a similar way to other databases or tools like SAS PROC TRANSPOSE. - The
PIVOT
function reshapes data by turning rows into columns, whileUNPIVOT
reverses this process by converting columns back into rows.
We can refer to the official Spark documentation for more details:
Leave a Reply