Here’s an enhanced Spark SQL cheatsheet with additional details, covering join types, union types, and set operations like EXCEPT and INTERSECT, along with options for table management (DDL operations like UPDATE, INSERT, DELETE, etc.). This comprehensive sheet is designed to help with quick Spark SQL reference.


CategoryConceptSyntax / ExampleDescription
Basic StatementsSELECTSELECT col1, col2 FROM table WHERE condition;Retrieves specific columns from a table based on a condition.
DISTINCTSELECT DISTINCT col FROM table;Returns unique values in the specified column.
LIMITSELECT * FROM table LIMIT 10;Restricts the number of rows returned by a query.
JoinsINNER JOINSELECT * FROM t1 JOIN t2 ON t1.id = t2.id;Returns rows that have matching values in both tables.
LEFT JOINSELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id;Returns all rows from the left table and matched rows from the right table; unmatched rows get NULL in columns from the right.
RIGHT JOINSELECT * FROM t1 RIGHT JOIN t2 ON t1.id = t2.id;Returns all rows from the right table and matched rows from the left table; unmatched rows get NULL in columns from the left.
FULL OUTER JOINSELECT * FROM t1 FULL JOIN t2 ON t1.id = t2.id;Returns rows when there is a match in either left or right table, including unmatched rows.
CROSS JOINSELECT * FROM t1 CROSS JOIN t2;Returns the Cartesian product of the two tables.
Set OperationsUNIONSELECT * FROM t1 UNION SELECT * FROM t2;Combines result sets from multiple queries, removing duplicates by default.
UNION ALLSELECT * FROM t1 UNION ALL SELECT * FROM t2;Combines result sets from multiple queries without removing duplicates.
INTERSECTSELECT * FROM t1 INTERSECT SELECT * FROM t2;Returns only the rows present in both queries.
EXCEPTSELECT * FROM t1 EXCEPT SELECT * FROM t2;Returns rows present in the first query but not in the second query.
EXCEPT ALLSELECT * FROM t1 EXCEPT ALL SELECT * FROM t2;Returns all rows in the first query that aren’t in the second, including duplicates.
Table ManagementCREATE TABLECREATE TABLE table_name (id INT, name STRING);Creates a new table with specified columns and data types.
DESCRIBEDESCRIBE TABLE table_name;Shows the structure and metadata of a table.
ALTER TABLEALTER TABLE table_name ADD COLUMNS (age INT);Adds columns or modifies a table’s structure.
DROP TABLEDROP TABLE IF EXISTS table_name;Deletes a table if it exists.
TRUNCATE TABLETRUNCATE TABLE table_name;Removes all rows from a table without deleting the table structure.
INSERT INTOINSERT INTO table_name VALUES (1, 'name');Adds new rows to a table.
INSERT OVERWRITEINSERT OVERWRITE table_name SELECT * FROM other_table;Replaces existing data in a table with the results of a query.
UPDATEUPDATE table_name SET col = 'value' WHERE condition;Updates specific columns based on a condition (SQL-style syntax may vary by environment).
DELETEDELETE FROM table_name WHERE condition;Deletes specific rows based on a condition (available in Delta tables, SQL-style syntax).
Window Functionsrow_number()ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2 DESC)Assigns a unique number to each row within a partition.
rank()RANK() OVER (PARTITION BY col ORDER BY col2 DESC)Assigns a rank to rows within a partition based on specified column(s).
lead(), lag()LEAD(col) OVER (ORDER BY col2)Accesses data from the following or preceding row.
Data Manipulation FunctionswithColumn()df.withColumn("newCol", df.oldCol + 1)Adds or replaces a column with the specified expression.
withColumnRenamed()df.withColumnRenamed("oldName", "newName")Renames a column.
selectExpr()df.selectExpr("col AS newCol", "col2 + 1")Selects columns or expressions using SQL syntax.
String Functionsconcat()SELECT concat(col1, col2) FROM table;Concatenates strings from multiple columns.
substring()SELECT substring(col, 1, 5) FROM table;Extracts a substring from a string column.
lower() / upper()SELECT lower(col) FROM table;Converts all characters in a string to lowercase or uppercase.
Date and Time Functionscurrent_date()SELECT current_date();Returns the current date.
datediff()SELECT datediff(end_date, start_date) FROM table;Returns the difference in days between two dates.
year(), month(), day()SELECT year(col) FROM table;Extracts parts of a date.
Aggregate Functionscollect_list()SELECT collect_list(col) FROM table;Aggregates values into a list for each group.
collect_set()SELECT collect_set(col) FROM table;Aggregates values into a unique set for each group.
avg(), sum(), count()SELECT sum(col), count(col) FROM table GROUP BY group_col;Performs aggregation functions like averaging, summing, or counting.
Optimization Techniquescache()df.cache()Caches the DataFrame in memory to optimize performance on repeated actions.
repartition()df.repartition(4, "col")Redistributes data across partitions for load balancing.
broadcast()broadcast(df)Optimizes joins by broadcasting smaller DataFrames to all nodes.
Predicate Pushdownspark.sql("SELECT * FROM table WHERE col = 'value'")Pushes filters down to the data source, reducing data scanned.
UDFsRegister UDFspark.udf.register("addOne", lambda x: x + 1)Registers a custom Python function as a UDF.
Using a UDFSELECT addOne(col) FROM table;Applies a UDF to a column in Spark SQL.
Schema ManagementprintSchema()df.printSchema()Displays the schema of a DataFrame.
schemadf.schemaReturns the schema as a StructType object.
Schema Mergespark.read.option("mergeSchema", "true")Merges schemas when reading from multiple files.
Complex TypesArraysARRAY<int>Defines an array type, e.g., ARRAY<int>.
StructSTRUCT<name: STRING, age: INT>Defines a nested structure.
Miscellaneousmonotonically_increasing_id()SELECT monotonically_increasing_id() AS id FROM table;Generates unique IDs for rows.
input_file_name()SELECT input_file_name() FROM table;Retrieves the file name associated with each row.
coalesce()SELECT coalesce(col1, col2) FROM table;Returns the first non-null value from the specified columns.
ConceptDescriptionSyntax/Example
Basic SelectRetrieves data from a table.SELECT column1, column2 FROM table;
WHERE ClauseFilters records based on conditions.SELECT * FROM table WHERE condition;
AggregationsSummarizes data (e.g., SUM, COUNT, AVG).SELECT SUM(column) FROM table GROUP BY column2;
Window FunctionsPerforms calculations across rows, like cumulative sums, rank, or row numbers.SELECT column, SUM(value) OVER (PARTITION BY column ORDER BY date) AS running_total FROM table;
JoinsCombines rows from two or more tables based on a related column.SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;
SubqueriesNested queries for complex operations or transformations.SELECT * FROM (SELECT column1 FROM table WHERE condition);
CTE (WITH Clause)Temporary result set for improved readability and reuse in complex queries.WITH temp AS (SELECT column FROM table) SELECT * FROM temp WHERE condition;
UNION/UNION ALLCombines results from multiple SELECT statements.SELECT column FROM table1 UNION SELECT column FROM table2;
PivotConverts rows into columns for specified values (aggregate with columns).SELECT * FROM (SELECT * FROM table) PIVOT (SUM(value) FOR column IN (‘value1’, ‘value2’));
UnpivotConverts columns into rows, useful for restructuring wide tables.SELECT * FROM table UNPIVOT (value FOR name IN (col1, col2, col3));
ViewsVirtual table based on a SELECT query, allows simplified access.CREATE VIEW view_name AS SELECT column FROM table;
Temporary TablesTemporary storage for session-specific tables, deleted when the session ends.CREATE TEMPORARY TABLE temp_table AS SELECT * FROM table;
Group By with HAVINGGroups data by specified columns and applies conditions to aggregated data.SELECT column, COUNT(*) FROM table GROUP BY column HAVING COUNT(*) > value;
Case StatementsConditional logic within SQL queries for creating calculated columns.SELECT column, CASE WHEN condition THEN ‘result1’ ELSE ‘result2’ END AS new_column FROM table;
Window FrameSpecifies the range for window functions, often cumulative or sliding (rows between clauses).SUM(column) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);

Spark SQL Functions with Complex Use Cases

FunctionExample Use Case
collect_list()Aggregate column values into a list for each group, useful in tracking customer purchase history: SELECT customer_id, collect_list(product) FROM purchases GROUP BY customer_id;
concat()Concatenate multiple address fields into one formatted address: SELECT concat(street, ', ', city, ', ', zip) AS address FROM addresses;
row_number()Number rows within each group, useful for ranking: SELECT *, row_number() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM sales_data;
date_add()Calculate future dates, such as a payment due date: SELECT order_id, date_add(order_date, 30) AS due_date FROM orders;
when() and coalesce()Assign risk categories while handling nulls: SELECT customer_id, when(age > 60, 'high').when(age > 30, 'medium').otherwise('low') AS risk, coalesce(income, 0) AS income FROM customers;
array_contains()Filter for specific tags in an array column: SELECT * FROM posts WHERE array_contains(tags, 'pyspark');
explode()Expand array items into individual rows: SELECT order_id, explode(items) AS item FROM orders;

Conditional aggregation in Spark SQL:

Here’s an example of conditional aggregation in Spark SQL:

SELECT 
  SUM(CASE WHEN age > 30 THEN 1 ELSE 0 END) AS count_over_30,
  SUM(CASE WHEN age <= 30 THEN 1 ELSE 0 END) AS count_under_30
FROM 
  customers;

In this example, we’re using a CASE statement to conditionally sum the values. If the age is greater than 30, we sum 1, otherwise we sum 0.

Using IF Function

Alternatively, you can use the IF function:

SELECT 
  SUM(IF(age > 30, 1, 0)) AS count_over_30,
  SUM(IF(age <= 30, 1, 0)) AS count_under_30
FROM 
  customers;

Spark SQL commands to manage views

Here are the Spark SQL commands to manage views:

Creating Views

1. Creating Virtual Views

CREATE VIEW my_view AS SELECT * FROM my_table;

2. Creating Temporary Views

CREATE TEMPORARY VIEW my_temp_view AS SELECT * FROM my_table;

3. Creating or Replacing Temporary Views

CREATE OR REPLACE TEMPORARY VIEW my_temp_view AS SELECT * FROM my_table;

Deleting Views

1. Deleting Temporary Views if Exists

DROP VIEW IF EXISTS my_temp_view;

Checking Views

1. Checking Views DDL

DESCRIBE FORMATTED my_view;

2. Checking Extended Query

EXPLAIN EXTENDED SELECT * FROM my_view;

3. Checking in Spark Catalog

SHOW TABLES IN my_database;  // lists all tables and views
SHOW VIEWS IN my_database;    // lists only views

Note:

  • CREATE VIEW creates a virtual view, which is a stored query that doesn’t store data.
  • CREATE TEMPORARY VIEW creates a temporary view that is only available in the current Spark session.
  • CREATE OR REPLACE TEMPORARY VIEW creates or replaces a temporary view.
  • DROP VIEW IF EXISTS deletes a view if it exists.
  • DESCRIBE FORMATTED shows the DDL of a view.
  • EXPLAIN EXTENDED shows the extended query plan of a view.
  • SHOW TABLES and SHOW VIEWS list tables and views in the Spark catalog.

Here are the different types of joins in Spark SQL, along with examples:

1. Inner Join

Returns only the rows that have a match in both tables.

SELECT *
FROM table1
INNER JOIN table2
ON table1.id = table2.id;

2. Left Outer Join (or Left Join)

Returns all the rows from the left table and the matching rows from the right table. If there’s no match, the result will contain NULL values.

SELECT *
FROM table1
LEFT JOIN table2
ON table1.id = table2.id;

3. Right Outer Join (or Right Join)

Similar to the left outer join, but returns all the rows from the right table and the matching rows from the left table.

SELECT *
FROM table1
RIGHT JOIN table2
ON table1.id = table2.id;

4. Full Outer Join (or Full Join)

Returns all the rows from both tables, with NULL values in the columns where there are no matches.

SELECT *
FROM table1
FULL OUTER JOIN table2
ON table1.id = table2.id;

5. Semi Join

Returns only the rows from the left table that have a match in the right table.

SELECT *
FROM table1
JOIN table2
ON table1.id = table2.id;

6. Anti Join

Returns only the rows from the left table that do not have a match in the right table.

SELECT *
FROM table1
LEFT ANTI JOIN table2
ON table1.id = table2.id;

7. Cross Join

Returns the Cartesian product of both tables.

SELECT *
FROM table1
CROSS JOIN table2;

Note: Spark SQL also supports using the USING clause to specify the join condition, like this:

SELECT *
FROM table1
JOIN table2
USING (id);

Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading