Here’s a categorized Spark SQL function reference, which organizes common Spark SQL functions by functionality. This can help with selecting the right function based on the operation you want to perform.
1. Aggregate Functions
| Function | Description | Example | 
|---|---|---|
| avg() | Calculates the average value. | SELECT avg(age) FROM table; | 
| count() | Counts the number of rows. | SELECT count(*) FROM table; | 
| max() | Finds the maximum value. | SELECT max(salary) FROM table; | 
| min() | Finds the minimum value. | SELECT min(age) FROM table; | 
| sum() | Calculates the sum of a column. | SELECT sum(salary) FROM table; | 
| stddev() | Calculates the standard deviation. | SELECT stddev(salary) FROM table; | 
| variance() | Calculates the variance. | SELECT variance(salary) FROM table; | 
2. Analytic Functions
| Function | Description | Example | 
|---|---|---|
| row_number() | Assigns a unique number to each row in a window. | ROW_NUMBER() OVER (PARTITION BY city) | 
| rank() | Assigns a rank to each row in a partition. | RANK() OVER (ORDER BY salary DESC) | 
| dense_rank() | Similar to rank but without gaps. | DENSE_RANK() OVER (ORDER BY age ASC) | 
| ntile(n) | Divides rows into nbuckets. | NTILE(4) OVER (ORDER BY age) | 
| lead() | Accesses a row after the current row. | LEAD(salary, 1) OVER (ORDER BY age) | 
| lag() | Accesses a row before the current row. | LAG(salary, 1) OVER (ORDER BY age) | 
3. String Functions
| Function | Description | Example | 
|---|---|---|
| concat() | Concatenates multiple strings. | SELECT concat(first_name, last_name) FROM table; | 
| substring() | Extracts a substring from a string. | SELECT substring(name, 1, 3) FROM table; | 
| length() | Returns the length of a string. | SELECT length(name) FROM table; | 
| lower() | Converts string to lowercase. | SELECT lower(name) FROM table; | 
| upper() | Converts string to uppercase. | SELECT upper(name) FROM table; | 
| trim() | Trims spaces from both ends of a string. | SELECT trim(name) FROM table; | 
| replace() | Replaces a substring within a string. | SELECT replace(name, 'a', 'b') FROM table; | 
| split() | Splits a string into an array. | SELECT split(email, '@') FROM table; | 
4. Date and Time Functions
| Function | Description | Example | 
|---|---|---|
| current_date() | Returns the current date. | SELECT current_date(); | 
| current_timestamp() | Returns the current timestamp. | SELECT current_timestamp(); | 
| datediff() | Returns difference in days between two dates. | SELECT datediff(date1, date2) FROM table; | 
| year(),month(),day() | Extracts year, month, day from date. | SELECT year(birthdate) FROM table; | 
| date_add() | Adds days to a date. | SELECT date_add(date, 10) FROM table; | 
| date_sub() | Subtracts days from a date. | SELECT date_sub(date, 10) FROM table; | 
| to_date() | Converts string to date. | SELECT to_date(string_date) FROM table; | 
| to_timestamp() | Converts string to timestamp. | SELECT to_timestamp(string_timestamp) FROM table; | 
5. Mathematical Functions
| Function | Description | Example | 
|---|---|---|
| abs() | Returns absolute value. | SELECT abs(-10) FROM table; | 
| ceil() | Rounds up to the nearest integer. | SELECT ceil(salary) FROM table; | 
| floor() | Rounds down to the nearest integer. | SELECT floor(salary) FROM table; | 
| round() | Rounds to a specified number of decimal places. | SELECT round(salary, 2) FROM table; | 
| sqrt() | Returns the square root. | SELECT sqrt(age) FROM table; | 
| pow() | Returns a number raised to a power. | SELECT pow(salary, 2) FROM table; | 
| exp() | Returns e^x (exponential). | SELECT exp(age) FROM table; | 
| log() | Returns the logarithm of a number. | SELECT log(salary) FROM table; | 
6. Array Functions
| Function | Description | Example | 
|---|---|---|
| array() | Creates an array from multiple values. | SELECT array('a', 'b', 'c'); | 
| size() | Returns the number of elements in an array. | SELECT size(array_column) FROM table; | 
| array_contains() | Checks if an array contains a specified value. | SELECT array_contains(array_column, 'value') FROM table; | 
| explode() | Creates a new row for each element in the array. | SELECT explode(array_column) FROM table; | 
| sort_array() | Sorts the elements of an array in ascending order. | SELECT sort_array(array_column) FROM table; | 
| array_distinct() | Removes duplicate values from an array. | SELECT array_distinct(array_column) FROM table; | 
7. Conditional Functions
| Function | Description | Example | 
|---|---|---|
| when() | Conditional expression that works like an if-else. | SELECT when(age > 18, 'adult').otherwise('minor') FROM table; | 
| coalesce() | Returns the first non-null value. | SELECT coalesce(col1, col2) FROM table; | 
| ifnull() | Returns the second value if the first is null. | SELECT ifnull(col1, 'unknown') FROM table; | 
| nullif() | Returns NULL if the two values are equal. | SELECT nullif(col1, col2) FROM table; | 
| nvl() | Replaces NULL with a specified value. | SELECT nvl(col, 'default') FROM table; | 
8. Miscellaneous Functions
| Function | Description | Example | 
|---|---|---|
| lit() | Converts a literal value to a column. | SELECT lit(1) FROM table; | 
| rand() | Generates a random number between 0 and 1. | SELECT rand() FROM table; | 
| monotonically_increasing_id() | Returns a unique ID for each row. | SELECT monotonically_increasing_id() FROM table; | 
| input_file_name() | Returns the file name of the source data. | SELECT input_file_name() FROM table; | 
This categorized list provides a quick reference for Spark SQL functions based on what kind of operation they perform, making it useful for development and troubleshooting in Spark SQL queries.
The collect_list() function is categorized under Aggregate Functions in Spark SQL. It aggregates data by collecting values into a list within each group, without removing duplicates. Here’s a quick example and then an exploration of some interesting, complex use cases for various functions, including collect_list().
1. Advanced Use Cases of Aggregate Functions
collect_list()
| Function | Example | Description | 
|---|---|---|
| collect_list() | SELECT id, collect_list(name) FROM table GROUP BY id; | Collects all values of a column into a list for each group, preserving duplicates. | 
Complex Use Case: If you have a dataset where each customer has multiple orders, and you want to get a list of order dates for each customer:
SELECT customer_id, collect_list(order_date) AS order_dates
FROM orders
GROUP BY customer_id;
This is useful in generating lists of values within groups, such as viewing the product purchase history of customers or tracking all updates to a particular row over time.
2. String Functions with Complex Use Cases
concat() and replace()
| Function | Example | Description | 
|---|---|---|
| concat() | SELECT concat(city, ', ', state) AS location FROM table; | Joins multiple columns or strings together. | 
| replace() | SELECT replace(phone, '-', '') AS phone_no FROM table; | Replaces parts of a string based on a pattern. | 
Complex Use Case: Concatenating multiple address fields to form a single address and cleaning up data with replace():
SELECT concat(street, ', ', city, ', ', replace(zip, '-', '')) AS full_address
FROM addresses;
This would be useful for standardizing or preparing addresses for mailing systems by merging fields and removing unnecessary characters.
3. Analytic Functions with Interesting Examples
row_number(), rank(), lead(), and lag()
| Function | Example | Description | 
|---|---|---|
| row_number() | ROW_NUMBER() OVER (PARTITION BY city ORDER BY age DESC) | Assigns a unique number to each row in a window. | 
| lead() | LEAD(salary, 1) OVER (ORDER BY age) | Retrieves the next row’s value in the current row’s column. | 
| lag() | LAG(salary, 1) OVER (ORDER BY age) | Retrieves the previous row’s value in the current row’s column. | 
Complex Use Case: Track sales growth over time by calculating month-over-month difference in sales:
SELECT month, sales,
       sales - LAG(sales, 1) OVER (ORDER BY month) AS sales_diff
FROM sales_data;
This lets you analyze trends or identify dips and peaks in performance by using the previous row’s data directly.
4. Date and Time Functions for Advanced Operations
date_add(), datediff(), year()
| Function | Example | Description | 
|---|---|---|
| date_add() | SELECT date_add(order_date, 30) AS due_date FROM table; | Adds a specific number of days to a date. | 
| datediff() | SELECT datediff(due_date, order_date) AS days_to_ship FROM table; | Calculates the difference in days between two dates. | 
| year(),month() | SELECT year(birthdate) AS birth_year FROM table; | Extracts parts of a date. | 
Complex Use Case: Calculate the monthly retention rate by finding customers who ordered in consecutive months:
SELECT customer_id,
       month(order_date) AS month,
       count(*) AS orders_this_month,
       LEAD(month(order_date), 1) OVER (PARTITION BY customer_id ORDER BY order_date) - month(order_date) = 1 AS retained
FROM orders
GROUP BY customer_id, month;
This example is powerful for retention analysis, determining if customers return month after month.
5. Array Functions for Multi-Value Column Manipulation
array(), array_contains(), explode()
| Function | Example | Description | 
|---|---|---|
| array() | SELECT array(name, email) AS contact FROM table; | Combines multiple columns into an array. | 
| array_contains() | SELECT array_contains(tags, 'Spark') AS has_spark FROM table; | Checks if an array contains a value. | 
| explode() | SELECT id, explode(items) AS item FROM orders; | Expands array elements into individual rows. | 
Complex Use Case: Splitting tags into individual rows for better indexing and searching:
SELECT post_id, explode(tags) AS tag
FROM blog_posts;
This approach can help with filtering, analytics, or search functionalities in scenarios where each tag or attribute of an item needs to be analyzed individually.
6. Conditional Functions for Complex Case Logic
when(), coalesce(), ifnull()
| Function | Example | Description | 
|---|---|---|
| when() | SELECT when(age > 18, 'adult').otherwise('minor') AS category FROM table; | Implements conditional logic similar to if-else. | 
| coalesce() | SELECT coalesce(col1, col2) AS result FROM table; | Returns the first non-null value from a list of columns. | 
| ifnull() | SELECT ifnull(name, 'unknown') FROM table; | Replaces NULLvalues with a specified value. | 
Complex Use Case: Assign customers to risk categories based on age and income, while handling null values with coalesce():
SELECT customer_id,
       when(age >= 60, 'high risk')
       .when(age >= 30, 'medium risk')
       .otherwise('low risk') AS risk_category,
       coalesce(income, 0) AS income
FROM customers;
This is especially useful in customer segmentation, where you may need to categorize data based on multiple conditions and handle null values simultaneously.
7. Miscellaneous Functions for Unique Use Cases
monotonically_increasing_id(), input_file_name()
| Function | Example | Description | 
|---|---|---|
| monotonically_increasing_id() | SELECT monotonically_increasing_id() AS id FROM table; | Generates unique IDs for rows without requiring a primary key. | 
| input_file_name() | SELECT input_file_name() FROM table; | Returns the file name from which a row originated. | 
Complex Use Case: Track records to their source file in an ETL process, useful in error tracking:
SELECT *, input_file_name() AS source_file
FROM data_files;
This helps you trace errors back to their data origin, making it valuable in complex ETL jobs involving multiple data sources.
These advanced use cases highlight the flexibility and power of Spark SQL functions when applied to complex data manipulation, making it easier to handle various analytical scenarios and transformations.
Leave a Reply