Here’s an enhanced Spark SQL cheatsheet with additional details, covering join types, union types, and set operations like EXCEPT
and INTERSECT
, along with options for table management (DDL operations like UPDATE
, INSERT
, DELETE
, etc.). This comprehensive sheet is designed to help with quick Spark SQL reference.
Category | Concept | Syntax / Example | Description |
---|---|---|---|
Basic Statements | SELECT | SELECT col1, col2 FROM table WHERE condition; | Retrieves specific columns from a table based on a condition. |
DISTINCT | SELECT DISTINCT col FROM table; | Returns unique values in the specified column. | |
LIMIT | SELECT * FROM table LIMIT 10; | Restricts the number of rows returned by a query. | |
Joins | INNER JOIN | SELECT * FROM t1 JOIN t2 ON t1.id = t2.id; | Returns rows that have matching values in both tables. |
LEFT JOIN | SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id; | Returns all rows from the left table and matched rows from the right table; unmatched rows get NULL in columns from the right. | |
RIGHT JOIN | SELECT * FROM t1 RIGHT JOIN t2 ON t1.id = t2.id; | Returns all rows from the right table and matched rows from the left table; unmatched rows get NULL in columns from the left. | |
FULL OUTER JOIN | SELECT * FROM t1 FULL JOIN t2 ON t1.id = t2.id; | Returns rows when there is a match in either left or right table, including unmatched rows. | |
CROSS JOIN | SELECT * FROM t1 CROSS JOIN t2; | Returns the Cartesian product of the two tables. | |
Set Operations | UNION | SELECT * FROM t1 UNION SELECT * FROM t2; | Combines result sets from multiple queries, removing duplicates by default. |
UNION ALL | SELECT * FROM t1 UNION ALL SELECT * FROM t2; | Combines result sets from multiple queries without removing duplicates. | |
INTERSECT | SELECT * FROM t1 INTERSECT SELECT * FROM t2; | Returns only the rows present in both queries. | |
EXCEPT | SELECT * FROM t1 EXCEPT SELECT * FROM t2; | Returns rows present in the first query but not in the second query. | |
EXCEPT ALL | SELECT * FROM t1 EXCEPT ALL SELECT * FROM t2; | Returns all rows in the first query that aren’t in the second, including duplicates. | |
Table Management | CREATE TABLE | CREATE TABLE table_name (id INT, name STRING); | Creates a new table with specified columns and data types. |
DESCRIBE | DESCRIBE TABLE table_name; | Shows the structure and metadata of a table. | |
ALTER TABLE | ALTER TABLE table_name ADD COLUMNS (age INT); | Adds columns or modifies a table’s structure. | |
DROP TABLE | DROP TABLE IF EXISTS table_name; | Deletes a table if it exists. | |
TRUNCATE TABLE | TRUNCATE TABLE table_name; | Removes all rows from a table without deleting the table structure. | |
INSERT INTO | INSERT INTO table_name VALUES (1, 'name'); | Adds new rows to a table. | |
INSERT OVERWRITE | INSERT OVERWRITE table_name SELECT * FROM other_table; | Replaces existing data in a table with the results of a query. | |
UPDATE | UPDATE table_name SET col = 'value' WHERE condition; | Updates specific columns based on a condition (SQL-style syntax may vary by environment). | |
DELETE | DELETE FROM table_name WHERE condition; | Deletes specific rows based on a condition (available in Delta tables, SQL-style syntax). | |
Window Functions | row_number() | ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2 DESC) | Assigns a unique number to each row within a partition. |
rank() | RANK() OVER (PARTITION BY col ORDER BY col2 DESC) | Assigns a rank to rows within a partition based on specified column(s). | |
lead() , lag() | LEAD(col) OVER (ORDER BY col2) | Accesses data from the following or preceding row. | |
Data Manipulation Functions | withColumn() | df.withColumn("newCol", df.oldCol + 1) | Adds or replaces a column with the specified expression. |
withColumnRenamed() | df.withColumnRenamed("oldName", "newName") | Renames a column. | |
selectExpr() | df.selectExpr("col AS newCol", "col2 + 1") | Selects columns or expressions using SQL syntax. | |
String Functions | concat() | SELECT concat(col1, col2) FROM table; | Concatenates strings from multiple columns. |
substring() | SELECT substring(col, 1, 5) FROM table; | Extracts a substring from a string column. | |
lower() / upper() | SELECT lower(col) FROM table; | Converts all characters in a string to lowercase or uppercase. | |
Date and Time Functions | current_date() | SELECT current_date(); | Returns the current date. |
datediff() | SELECT datediff(end_date, start_date) FROM table; | Returns the difference in days between two dates. | |
year() , month() , day() | SELECT year(col) FROM table; | Extracts parts of a date. | |
Aggregate Functions | collect_list() | SELECT collect_list(col) FROM table; | Aggregates values into a list for each group. |
collect_set() | SELECT collect_set(col) FROM table; | Aggregates values into a unique set for each group. | |
avg() , sum() , count() | SELECT sum(col), count(col) FROM table GROUP BY group_col; | Performs aggregation functions like averaging, summing, or counting. | |
Optimization Techniques | cache() | df.cache() | Caches the DataFrame in memory to optimize performance on repeated actions. |
repartition() | df.repartition(4, "col") | Redistributes data across partitions for load balancing. | |
broadcast() | broadcast(df) | Optimizes joins by broadcasting smaller DataFrames to all nodes. | |
Predicate Pushdown | spark.sql("SELECT * FROM table WHERE col = 'value'") | Pushes filters down to the data source, reducing data scanned. | |
UDFs | Register UDF | spark.udf.register("addOne", lambda x: x + 1) | Registers a custom Python function as a UDF. |
Using a UDF | SELECT addOne(col) FROM table; | Applies a UDF to a column in Spark SQL. | |
Schema Management | printSchema() | df.printSchema() | Displays the schema of a DataFrame. |
schema | df.schema | Returns the schema as a StructType object. | |
Schema Merge | spark.read.option("mergeSchema", "true") | Merges schemas when reading from multiple files. | |
Complex Types | Arrays | ARRAY<int> | Defines an array type, e.g., ARRAY<int> . |
Struct | STRUCT<name: STRING, age: INT> | Defines a nested structure. | |
Miscellaneous | monotonically_increasing_id() | SELECT monotonically_increasing_id() AS id FROM table; | Generates unique IDs for rows. |
input_file_name() | SELECT input_file_name() FROM table; | Retrieves the file name associated with each row. | |
coalesce() | SELECT coalesce(col1, col2) FROM table; | Returns the first non-null value from the specified columns. |
Concept | Description | Syntax/Example |
Basic Select | Retrieves data from a table. | SELECT column1, column2 FROM table; |
WHERE Clause | Filters records based on conditions. | SELECT * FROM table WHERE condition; |
Aggregations | Summarizes data (e.g., SUM, COUNT, AVG). | SELECT SUM(column) FROM table GROUP BY column2; |
Window Functions | Performs calculations across rows, like cumulative sums, rank, or row numbers. | SELECT column, SUM(value) OVER (PARTITION BY column ORDER BY date) AS running_total FROM table; |
Joins | Combines rows from two or more tables based on a related column. | SELECT * FROM table1 JOIN table2 ON table1.id = table2.id; |
Subqueries | Nested queries for complex operations or transformations. | SELECT * FROM (SELECT column1 FROM table WHERE condition); |
CTE (WITH Clause) | Temporary result set for improved readability and reuse in complex queries. | WITH temp AS (SELECT column FROM table) SELECT * FROM temp WHERE condition; |
UNION/UNION ALL | Combines results from multiple SELECT statements. | SELECT column FROM table1 UNION SELECT column FROM table2; |
Pivot | Converts rows into columns for specified values (aggregate with columns). | SELECT * FROM (SELECT * FROM table) PIVOT (SUM(value) FOR column IN (‘value1’, ‘value2’)); |
Unpivot | Converts columns into rows, useful for restructuring wide tables. | SELECT * FROM table UNPIVOT (value FOR name IN (col1, col2, col3)); |
Views | Virtual table based on a SELECT query, allows simplified access. | CREATE VIEW view_name AS SELECT column FROM table; |
Temporary Tables | Temporary storage for session-specific tables, deleted when the session ends. | CREATE TEMPORARY TABLE temp_table AS SELECT * FROM table; |
Group By with HAVING | Groups data by specified columns and applies conditions to aggregated data. | SELECT column, COUNT(*) FROM table GROUP BY column HAVING COUNT(*) > value; |
Case Statements | Conditional logic within SQL queries for creating calculated columns. | SELECT column, CASE WHEN condition THEN ‘result1’ ELSE ‘result2’ END AS new_column FROM table; |
Window Frame | Specifies the range for window functions, often cumulative or sliding (rows between clauses). | SUM(column) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW); |
Spark SQL Functions with Complex Use Cases
Function | Example Use Case |
---|---|
collect_list() | Aggregate column values into a list for each group, useful in tracking customer purchase history: SELECT customer_id, collect_list(product) FROM purchases GROUP BY customer_id; |
concat() | Concatenate multiple address fields into one formatted address: SELECT concat(street, ', ', city, ', ', zip) AS address FROM addresses; |
row_number() | Number rows within each group, useful for ranking: SELECT *, row_number() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM sales_data; |
date_add() | Calculate future dates, such as a payment due date: SELECT order_id, date_add(order_date, 30) AS due_date FROM orders; |
when() and coalesce() | Assign risk categories while handling nulls: SELECT customer_id, when(age > 60, 'high').when(age > 30, 'medium').otherwise('low') AS risk, coalesce(income, 0) AS income FROM customers; |
array_contains() | Filter for specific tags in an array column: SELECT * FROM posts WHERE array_contains(tags, 'pyspark'); |
explode() | Expand array items into individual rows: SELECT order_id, explode(items) AS item FROM orders; |
Conditional aggregation in Spark SQL:
Here’s an example of conditional aggregation in Spark SQL:
SELECT
SUM(CASE WHEN age > 30 THEN 1 ELSE 0 END) AS count_over_30,
SUM(CASE WHEN age <= 30 THEN 1 ELSE 0 END) AS count_under_30
FROM
customers;
In this example, we’re using a CASE
statement to conditionally sum the values. If the age
is greater than 30, we sum 1, otherwise we sum 0.
Using IF Function
Alternatively, you can use the IF
function:
SELECT
SUM(IF(age > 30, 1, 0)) AS count_over_30,
SUM(IF(age <= 30, 1, 0)) AS count_under_30
FROM
customers;
Spark SQL commands to manage views
Here are the Spark SQL commands to manage views:
Creating Views
1. Creating Virtual Views
CREATE VIEW my_view AS SELECT * FROM my_table;
2. Creating Temporary Views
CREATE TEMPORARY VIEW my_temp_view AS SELECT * FROM my_table;
3. Creating or Replacing Temporary Views
CREATE OR REPLACE TEMPORARY VIEW my_temp_view AS SELECT * FROM my_table;
Deleting Views
1. Deleting Temporary Views if Exists
DROP VIEW IF EXISTS my_temp_view;
Checking Views
1. Checking Views DDL
DESCRIBE FORMATTED my_view;
2. Checking Extended Query
EXPLAIN EXTENDED SELECT * FROM my_view;
3. Checking in Spark Catalog
SHOW TABLES IN my_database; // lists all tables and views
SHOW VIEWS IN my_database; // lists only views
Note:
CREATE VIEW
creates a virtual view, which is a stored query that doesn’t store data.CREATE TEMPORARY VIEW
creates a temporary view that is only available in the current Spark session.CREATE OR REPLACE TEMPORARY VIEW
creates or replaces a temporary view.DROP VIEW IF EXISTS
deletes a view if it exists.DESCRIBE FORMATTED
shows the DDL of a view.EXPLAIN EXTENDED
shows the extended query plan of a view.SHOW TABLES
andSHOW VIEWS
list tables and views in the Spark catalog.
Here are the different types of joins in Spark SQL, along with examples:
1. Inner Join
Returns only the rows that have a match in both tables.
SELECT *
FROM table1
INNER JOIN table2
ON table1.id = table2.id;
2. Left Outer Join (or Left Join)
Returns all the rows from the left table and the matching rows from the right table. If there’s no match, the result will contain NULL values.
SELECT *
FROM table1
LEFT JOIN table2
ON table1.id = table2.id;
3. Right Outer Join (or Right Join)
Similar to the left outer join, but returns all the rows from the right table and the matching rows from the left table.
SELECT *
FROM table1
RIGHT JOIN table2
ON table1.id = table2.id;
4. Full Outer Join (or Full Join)
Returns all the rows from both tables, with NULL values in the columns where there are no matches.
SELECT *
FROM table1
FULL OUTER JOIN table2
ON table1.id = table2.id;
5. Semi Join
Returns only the rows from the left table that have a match in the right table.
SELECT *
FROM table1
JOIN table2
ON table1.id = table2.id;
6. Anti Join
Returns only the rows from the left table that do not have a match in the right table.
SELECT *
FROM table1
LEFT ANTI JOIN table2
ON table1.id = table2.id;
7. Cross Join
Returns the Cartesian product of both tables.
SELECT *
FROM table1
CROSS JOIN table2;
Note: Spark SQL also supports using the USING
clause to specify the join condition, like this:
SELECT *
FROM table1
JOIN table2
USING (id);
Discover more from HintsToday
Subscribe to get the latest posts sent to your email.