Here’s an enhanced Spark SQL cheatsheet with additional details, covering join types, union types, and set operations like EXCEPT
and INTERSECT
, along with options for table management (DDL operations like UPDATE
, INSERT
, DELETE
, etc.). This comprehensive sheet is designed to help with quick Spark SQL reference.
Category | Concept | Syntax / Example | Description |
---|---|---|---|
Basic Statements | SELECT | SELECT col1, col2 FROM table WHERE condition; | Retrieves specific columns from a table based on a condition. |
DISTINCT | SELECT DISTINCT col FROM table; | Returns unique values in the specified column. | |
LIMIT | SELECT * FROM table LIMIT 10; | Restricts the number of rows returned by a query. | |
Joins | INNER JOIN | SELECT * FROM t1 JOIN t2 ON t1.id = t2.id; | Returns rows that have matching values in both tables. |
LEFT JOIN | SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id; | Returns all rows from the left table and matched rows from the right table; unmatched rows get NULL in columns from the right. | |
RIGHT JOIN | SELECT * FROM t1 RIGHT JOIN t2 ON t1.id = t2.id; | Returns all rows from the right table and matched rows from the left table; unmatched rows get NULL in columns from the left. | |
FULL OUTER JOIN | SELECT * FROM t1 FULL JOIN t2 ON t1.id = t2.id; | Returns rows when there is a match in either left or right table, including unmatched rows. | |
CROSS JOIN | SELECT * FROM t1 CROSS JOIN t2; | Returns the Cartesian product of the two tables. | |
Set Operations | UNION | SELECT * FROM t1 UNION SELECT * FROM t2; | Combines result sets from multiple queries, removing duplicates by default. |
UNION ALL | SELECT * FROM t1 UNION ALL SELECT * FROM t2; | Combines result sets from multiple queries without removing duplicates. | |
INTERSECT | SELECT * FROM t1 INTERSECT SELECT * FROM t2; | Returns only the rows present in both queries. | |
EXCEPT | SELECT * FROM t1 EXCEPT SELECT * FROM t2; | Returns rows present in the first query but not in the second query. | |
EXCEPT ALL | SELECT * FROM t1 EXCEPT ALL SELECT * FROM t2; | Returns all rows in the first query that aren’t in the second, including duplicates. | |
Table Management | CREATE TABLE | CREATE TABLE table_name (id INT, name STRING); | Creates a new table with specified columns and data types. |
DESCRIBE | DESCRIBE TABLE table_name; | Shows the structure and metadata of a table. | |
ALTER TABLE | ALTER TABLE table_name ADD COLUMNS (age INT); | Adds columns or modifies a table’s structure. | |
DROP TABLE | DROP TABLE IF EXISTS table_name; | Deletes a table if it exists. | |
TRUNCATE TABLE | TRUNCATE TABLE table_name; | Removes all rows from a table without deleting the table structure. | |
INSERT INTO | INSERT INTO table_name VALUES (1, 'name'); | Adds new rows to a table. | |
INSERT OVERWRITE | INSERT OVERWRITE table_name SELECT * FROM other_table; | Replaces existing data in a table with the results of a query. | |
UPDATE | UPDATE table_name SET col = 'value' WHERE condition; | Updates specific columns based on a condition (SQL-style syntax may vary by environment). | |
DELETE | DELETE FROM table_name WHERE condition; | Deletes specific rows based on a condition (available in Delta tables, SQL-style syntax). | |
Window Functions | row_number() | ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2 DESC) | Assigns a unique number to each row within a partition. |
rank() | RANK() OVER (PARTITION BY col ORDER BY col2 DESC) | Assigns a rank to rows within a partition based on specified column(s). | |
lead() , lag() | LEAD(col) OVER (ORDER BY col2) | Accesses data from the following or preceding row. | |
Data Manipulation Functions | withColumn() | df.withColumn("newCol", df.oldCol + 1) | Adds or replaces a column with the specified expression. |
withColumnRenamed() | df.withColumnRenamed("oldName", "newName") | Renames a column. | |
selectExpr() | df.selectExpr("col AS newCol", "col2 + 1") | Selects columns or expressions using SQL syntax. | |
String Functions | concat() | SELECT concat(col1, col2) FROM table; | Concatenates strings from multiple columns. |
substring() | SELECT substring(col, 1, 5) FROM table; | Extracts a substring from a string column. | |
lower() / upper() | SELECT lower(col) FROM table; | Converts all characters in a string to lowercase or uppercase. | |
Date and Time Functions | current_date() | SELECT current_date(); | Returns the current date. |
datediff() | SELECT datediff(end_date, start_date) FROM table; | Returns the difference in days between two dates. | |
year() , month() , day() | SELECT year(col) FROM table; | Extracts parts of a date. | |
Aggregate Functions | collect_list() | SELECT collect_list(col) FROM table; | Aggregates values into a list for each group. |
collect_set() | SELECT collect_set(col) FROM table; | Aggregates values into a unique set for each group. | |
avg() , sum() , count() | SELECT sum(col), count(col) FROM table GROUP BY group_col; | Performs aggregation functions like averaging, summing, or counting. | |
Optimization Techniques | cache() | df.cache() | Caches the DataFrame in memory to optimize performance on repeated actions. |
repartition() | df.repartition(4, "col") | Redistributes data across partitions for load balancing. | |
broadcast() | broadcast(df) | Optimizes joins by broadcasting smaller DataFrames to all nodes. | |
Predicate Pushdown | spark.sql("SELECT * FROM table WHERE col = 'value'") | Pushes filters down to the data source, reducing data scanned. | |
UDFs | Register UDF | spark.udf.register("addOne", lambda x: x + 1) | Registers a custom Python function as a UDF. |
Using a UDF | SELECT addOne(col) FROM table; | Applies a UDF to a column in Spark SQL. | |
Schema Management | printSchema() | df.printSchema() | Displays the schema of a DataFrame. |
schema | df.schema | Returns the schema as a StructType object. | |
Schema Merge | spark.read.option("mergeSchema", "true") | Merges schemas when reading from multiple files. | |
Complex Types | Arrays | ARRAY<int> | Defines an array type, e.g., ARRAY<int> . |
Struct | STRUCT<name: STRING, age: INT> | Defines a nested structure. | |
Miscellaneous | monotonically_increasing_id() | SELECT monotonically_increasing_id() AS id FROM table; | Generates unique IDs for rows. |
input_file_name() | SELECT input_file_name() FROM table; | Retrieves the file name associated with each row. | |
coalesce() | SELECT coalesce(col1, col2) FROM table; | Returns the first non-null value from the specified columns. |
These categories and examples serve as a detailed reference for Spark SQL operations, covering foundational concepts, complex manipulations, and optimizations, ideal for a quick refresher. Let me know if you’d like any further details on specific topics or functions!
Spark SQL Functions with Complex Use Cases
Function | Example Use Case |
---|---|
collect_list() | Aggregate column values into a list for each group, useful in tracking customer purchase history: SELECT customer_id, collect_list(product) FROM purchases GROUP BY customer_id; |
concat() | Concatenate multiple address fields into one formatted address: SELECT concat(street, ', ', city, ', ', zip) AS address FROM addresses; |
row_number() | Number rows within each group, useful for ranking: SELECT *, row_number() OVER (PARTITION BY category ORDER BY sales DESC) AS rank FROM sales_data; |
date_add() | Calculate future dates, such as a payment due date: SELECT order_id, date_add(order_date, 30) AS due_date FROM orders; |
when() and coalesce() | Assign risk categories while handling nulls: SELECT customer_id, when(age > 60, 'high').when(age > 30, 'medium').otherwise('low') AS risk, coalesce(income, 0) AS income FROM customers; |
array_contains() | Filter for specific tags in an array column: SELECT * FROM posts WHERE array_contains(tags, 'pyspark'); |
explode() | Expand array items into individual rows: SELECT order_id, explode(items) AS item FROM orders; |
These entries capture essential Spark SQL commands and operations for data handling, transformations, optimizations, and schema management, offering a quick-reference sheet for both simple and complex use cases. Let me know if you’d like further examples or explanations on specific functions!
Leave a Reply