PySpark date functions along with detailed explanations and examples on Spark SQL
In PySpark, you can use various built-in date functions directly within SQL queries to manipulate and extract data from timestamp
, date
, and string
columns. PySpark’s SQL API provides a wide range of date functions similar to SQL, which allows you to perform operations like formatting, adding/subtracting time intervals, and extracting specific components from dates.
Here are some of the most commonly used PySpark date functions in SQL queries:
1. Current Date and Time Functions
current_date()
– Returns the current date.current_timestamp()
– Returns the current timestamp (date and time).now()
– Alias forcurrent_timestamp()
.
Example:
SELECT current_date() AS today_date,
current_timestamp() AS current_time
FROM your_table
2. Date and Time Arithmetic Functions
date_add(date, num_days)
– Addsnum_days
days to the givendate
.date_sub(date, num_days)
– Subtractsnum_days
days from the givendate
.add_months(date, num_months)
– Addsnum_months
months to the givendate
.datediff(end_date, start_date)
– Returns the difference between two dates.months_between(end_date, start_date)
– Returns the number of months between two dates.
Example:
SELECT date_add(current_date(), 5) AS five_days_from_now,
date_sub(current_date(), 10) AS ten_days_ago,
add_months(current_date(), 3) AS three_months_later,
datediff('2024-12-31', '2024-01-01') AS days_diff
FROM your_table
3. Date Formatting and Parsing Functions
date_format(date, format)
– Formats adate
according to the givenformat
string.to_date(string)
– Converts a string to a date.to_timestamp(string)
– Converts a string to a timestamp.unix_timestamp()
– Converts a date or timestamp to the number of seconds since the Unix epoch (1970-01-01).from_unixtime(unix_time, format)
– Converts Unix time to a string in the specified format.
Example:
SELECT date_format(current_timestamp(), 'yyyy-MM-dd') AS formatted_date,
to_date('2024-09-01', 'yyyy-MM-dd') AS converted_date,
unix_timestamp('2024-09-01', 'yyyy-MM-dd') AS unix_time,
from_unixtime(1693564800, 'yyyy-MM-dd') AS from_unix_time
FROM your_table
4. Extracting Components from Dates
year(date)
– Extracts the year from a date.month(date)
– Extracts the month from a date.day(date)
ordayofmonth(date)
– Extracts the day of the month from a date.hour(timestamp)
– Extracts the hour from a timestamp.minute(timestamp)
– Extracts the minute from a timestamp.second(timestamp)
– Extracts the second from a timestamp.dayofweek(date)
– Returns the day of the week (1 = Sunday, 7 = Saturday).weekofyear(date)
– Returns the week of the year for a given date.
Example:
SELECT year(current_date()) AS year,
month(current_date()) AS month,
day(current_date()) AS day,
dayofweek(current_date()) AS day_of_week,
weekofyear(current_date()) AS week_of_year
FROM your_table
5. Date Truncation Functions
trunc(date, format)
– Truncates a date to the specified unit (year, month, etc.).date_trunc(format, timestamp)
– Truncates a timestamp to the specified unit.
Example:
SELECT trunc(current_date(), 'MM') AS truncated_to_month,
date_trunc('MM', current_timestamp()) AS timestamp_truncated_to_month
FROM your_table
6.EXTRACT
Function in PySpark SQL:-
In PySpark SQL (or standard SQL), extract()
is a built-in function for extracting parts of a date or timestamp, such as extracting the year, month, day, hour, minute, etc., from a date or timestamp column.
Syntax:
EXTRACT(field FROM source)
field
: Specifies the part of the date/time to extract (e.g.,YEAR
,MONTH
,DAY
,HOUR
,MINUTE
, etc.).source
: The column containing a date or timestamp.
Example Usage of extract()
in PySpark SQL Queries:
If you have a PySpark DataFrame with a date column and you want to extract specific components (such as year or month), you can use extract()
in a SQL query.
Example 1: Extracting Year, Month, and Day from a Date:
Assume you have a PySpark DataFrame with a date
column in YYYY-MM-DD format, and you want to extract the year, month, and day.
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("ExtractExample").getOrCreate()
# Sample data
data = [("2023-09-01",), ("2024-12-31",), ("2020-01-20",)]
df = spark.createDataFrame(data, ["DateColumn"])
# Create a temporary SQL table
df.createOrReplaceTempView("date_table")
# SQL query to extract year, month, and day
result = spark.sql("""
SELECT
DateColumn,
EXTRACT(YEAR FROM DateColumn) AS Year,
EXTRACT(MONTH FROM DateColumn) AS Month,
EXTRACT(DAY FROM DateColumn) AS Day
FROM date_table
""")
result.show()
Output:
+----------+----+-----+---+
|DateColumn|Year|Month|Day|
+----------+----+-----+---+
|2023-09-01|2023| 9| 1|
|2024-12-31|2024| 12| 31|
|2020-01-20|2020| 1| 20|
+----------+----+-----+---+
Example 2: Extracting Hour and Minute from a Timestamp:
If you have a timestamp column and you want to extract specific time components such as hour and minute, you can use EXTRACT()
in a SQL query.
data = [("2023-09-01 12:30:45",), ("2024-12-31 18:45:00",), ("2020-01-20 07:15:25",)]
df = spark.createDataFrame(data, ["TimestampColumn"])
# Create a temporary SQL table
df.createOrReplaceTempView("timestamp_table")
# SQL query to extract hour and minute from the timestamp
result = spark.sql("""
SELECT
TimestampColumn,
EXTRACT(HOUR FROM TimestampColumn) AS Hour,
EXTRACT(MINUTE FROM TimestampColumn) AS Minute
FROM timestamp_table
""")
result.show()
Output:
+-------------------+----+------+
| TimestampColumn |Hour|Minute|
+-------------------+----+------+
|2023-09-01 12:30:45| 12| 30|
|2024-12-31 18:45:00| 18| 45|
|2020-01-20 07:15:25| 7| 15|
+-------------------+----+------+
Fields Supported by EXTRACT()
:
You can extract the following parts from a date or timestamp:
- YEAR: Extracts the year.
- MONTH: Extracts the month.
- DAY: Extracts the day.
- HOUR: Extracts the hour from a timestamp.
- MINUTE: Extracts the minute from a timestamp.
- SECOND: Extracts the second from a timestamp.
Using EXTRACT()
in PySpark’s SQL Queries:
In PySpark, you typically define a temporary view using createOrReplaceTempView()
to run SQL queries with the EXTRACT()
function.
Example Scenario:
If you have a dataset with timestamps and you want to run a SQL query to extract the hour and minute, you can write a PySpark SQL query like:
df.createOrReplaceTempView("log_table")
spark.sql("""
SELECT
EXTRACT(HOUR FROM timestamp_column) AS Hour,
EXTRACT(MINUTE FROM timestamp_column) AS Minute
FROM log_table
""").show()
This query will return the hour and minute extracted from the timestamp column.
- The
EXTRACT()
function in Spark SQL allows you to extract specific parts (like year, month, day, hour, minute) from date or timestamp columns. - It is used in SQL-based queries within PySpark, similar to how you’d use SQL functions for querying structured data.
So, if you saw extract()
in a PySpark SQL query, it was likely extracting parts of a date or timestamp column.
7. Other Useful Date Functions
last_day(date)
– Returns the last day of the month for a given date.next_day(date, day_of_week)
– Returns the next date after the given date for the specified day of the week.quarter(date)
– Returns the quarter of the year (1, 2, 3, or 4).
Example:
SELECT last_day(current_date()) AS last_day_of_month,
next_day(current_date(), 'Monday') AS next_monday,
quarter(current_date()) AS current_quarter
FROM your_table
Complete Example of a PySpark SQL Query:
Assume you have a table events
with a timestamp
column named event_time
. Here’s a query that selects events from the current year, formats the event date, and extracts various components from the timestamp:
SELECT event_time,
date_format(event_time, 'yyyy-MM-dd') AS formatted_date,
year(event_time) AS event_year,
month(event_time) AS event_month,
dayofmonth(event_time) AS event_day,
hour(event_time) AS event_hour,
minute(event_time) AS event_minute,
second(event_time) AS event_second,
quarter(event_time) AS event_quarter,
weekofyear(event_time) AS event_week
FROM events
WHERE year(event_time) = year(current_date())
Notes:
Integration in PySpark: You can execute SQL queries with date functions using spark.sql()
.
For example:
result = spark.sql(""" SELECT date_add(current_date(), 5) AS five_days_from_now, year(current_timestamp()) AS current_year FROM your_table """) result.show()
Date Functions for Filtering: You can use these functions in the WHERE
clause to filter data based on dates:
SELECT * FROM your_table WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31'
These built-in PySpark SQL date functions provide extensive capabilities for handling, formatting, and manipulating date and time data at scale in Spark applications.
Discover more from HintsToday
Subscribe to get the latest posts sent to your email.