In PySpark, you can use various built-in date functions directly within SQL queries to manipulate and extract data from timestamp
, date
, and string
columns. PySpark’s SQL API provides a wide range of date functions similar to SQL, which allows you to perform operations like formatting, adding/subtracting time intervals, and extracting specific components from dates.
Here are some of the most commonly used PySpark date functions in SQL queries:
1. Current Date and Time Functions
current_date()
– Returns the current date.current_timestamp()
– Returns the current timestamp (date and time).now()
– Alias forcurrent_timestamp()
.
Example:
SELECT current_date() AS today_date,
current_timestamp() AS current_time
FROM your_table
2. Date and Time Arithmetic Functions
date_add(date, num_days)
– Addsnum_days
days to the givendate
.date_sub(date, num_days)
– Subtractsnum_days
days from the givendate
.add_months(date, num_months)
– Addsnum_months
months to the givendate
.datediff(end_date, start_date)
– Returns the difference between two dates.months_between(end_date, start_date)
– Returns the number of months between two dates.
Example:
SELECT date_add(current_date(), 5) AS five_days_from_now,
date_sub(current_date(), 10) AS ten_days_ago,
add_months(current_date(), 3) AS three_months_later,
datediff('2024-12-31', '2024-01-01') AS days_diff
FROM your_table
3. Date Formatting and Parsing Functions
date_format(date, format)
– Formats adate
according to the givenformat
string.to_date(string)
– Converts a string to a date.to_timestamp(string)
– Converts a string to a timestamp.unix_timestamp()
– Converts a date or timestamp to the number of seconds since the Unix epoch (1970-01-01).from_unixtime(unix_time, format)
– Converts Unix time to a string in the specified format.
Example:
SELECT date_format(current_timestamp(), 'yyyy-MM-dd') AS formatted_date,
to_date('2024-09-01', 'yyyy-MM-dd') AS converted_date,
unix_timestamp('2024-09-01', 'yyyy-MM-dd') AS unix_time,
from_unixtime(1693564800, 'yyyy-MM-dd') AS from_unix_time
FROM your_table
4. Extracting Components from Dates
year(date)
– Extracts the year from a date.month(date)
– Extracts the month from a date.day(date)
ordayofmonth(date)
– Extracts the day of the month from a date.hour(timestamp)
– Extracts the hour from a timestamp.minute(timestamp)
– Extracts the minute from a timestamp.second(timestamp)
– Extracts the second from a timestamp.dayofweek(date)
– Returns the day of the week (1 = Sunday, 7 = Saturday).weekofyear(date)
– Returns the week of the year for a given date.
Example:
SELECT year(current_date()) AS year,
month(current_date()) AS month,
day(current_date()) AS day,
dayofweek(current_date()) AS day_of_week,
weekofyear(current_date()) AS week_of_year
FROM your_table
5. Date Truncation Functions
trunc(date, format)
– Truncates a date to the specified unit (year, month, etc.).date_trunc(format, timestamp)
– Truncates a timestamp to the specified unit.
Example:
SELECT trunc(current_date(), 'MM') AS truncated_to_month,
date_trunc('MM', current_timestamp()) AS timestamp_truncated_to_month
FROM your_table
6.EXTRACT
Function in PySpark SQL:-
In PySpark SQL (or standard SQL), extract()
is a built-in function for extracting parts of a date or timestamp, such as extracting the year, month, day, hour, minute, etc., from a date or timestamp column.
Syntax:
EXTRACT(field FROM source)
field
: Specifies the part of the date/time to extract (e.g.,YEAR
,MONTH
,DAY
,HOUR
,MINUTE
, etc.).source
: The column containing a date or timestamp.
Example Usage of extract()
in PySpark SQL Queries:
If you have a PySpark DataFrame with a date column and you want to extract specific components (such as year or month), you can use extract()
in a SQL query.
Example 1: Extracting Year, Month, and Day from a Date:
Assume you have a PySpark DataFrame with a date
column in YYYY-MM-DD format, and you want to extract the year, month, and day.
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("ExtractExample").getOrCreate()
# Sample data
data = [("2023-09-01",), ("2024-12-31",), ("2020-01-20",)]
df = spark.createDataFrame(data, ["DateColumn"])
# Create a temporary SQL table
df.createOrReplaceTempView("date_table")
# SQL query to extract year, month, and day
result = spark.sql("""
SELECT
DateColumn,
EXTRACT(YEAR FROM DateColumn) AS Year,
EXTRACT(MONTH FROM DateColumn) AS Month,
EXTRACT(DAY FROM DateColumn) AS Day
FROM date_table
""")
result.show()
Output:
+----------+----+-----+---+
|DateColumn|Year|Month|Day|
+----------+----+-----+---+
|2023-09-01|2023| 9| 1|
|2024-12-31|2024| 12| 31|
|2020-01-20|2020| 1| 20|
+----------+----+-----+---+
Example 2: Extracting Hour and Minute from a Timestamp:
If you have a timestamp column and you want to extract specific time components such as hour and minute, you can use EXTRACT()
in a SQL query.
data = [("2023-09-01 12:30:45",), ("2024-12-31 18:45:00",), ("2020-01-20 07:15:25",)]
df = spark.createDataFrame(data, ["TimestampColumn"])
# Create a temporary SQL table
df.createOrReplaceTempView("timestamp_table")
# SQL query to extract hour and minute from the timestamp
result = spark.sql("""
SELECT
TimestampColumn,
EXTRACT(HOUR FROM TimestampColumn) AS Hour,
EXTRACT(MINUTE FROM TimestampColumn) AS Minute
FROM timestamp_table
""")
result.show()
Output:
+-------------------+----+------+
| TimestampColumn |Hour|Minute|
+-------------------+----+------+
|2023-09-01 12:30:45| 12| 30|
|2024-12-31 18:45:00| 18| 45|
|2020-01-20 07:15:25| 7| 15|
+-------------------+----+------+
Fields Supported by EXTRACT()
:
You can extract the following parts from a date or timestamp:
- YEAR: Extracts the year.
- MONTH: Extracts the month.
- DAY: Extracts the day.
- HOUR: Extracts the hour from a timestamp.
- MINUTE: Extracts the minute from a timestamp.
- SECOND: Extracts the second from a timestamp.
Using EXTRACT()
in PySpark’s SQL Queries:
In PySpark, you typically define a temporary view using createOrReplaceTempView()
to run SQL queries with the EXTRACT()
function.
Example Scenario:
If you have a dataset with timestamps and you want to run a SQL query to extract the hour and minute, you can write a PySpark SQL query like:
df.createOrReplaceTempView("log_table")
spark.sql("""
SELECT
EXTRACT(HOUR FROM timestamp_column) AS Hour,
EXTRACT(MINUTE FROM timestamp_column) AS Minute
FROM log_table
""").show()
This query will return the hour and minute extracted from the timestamp column.
- The
EXTRACT()
function in Spark SQL allows you to extract specific parts (like year, month, day, hour, minute) from date or timestamp columns. - It is used in SQL-based queries within PySpark, similar to how you’d use SQL functions for querying structured data.
So, if you saw extract()
in a PySpark SQL query, it was likely extracting parts of a date or timestamp column.
7. Other Useful Date Functions
last_day(date)
– Returns the last day of the month for a given date.next_day(date, day_of_week)
– Returns the next date after the given date for the specified day of the week.quarter(date)
– Returns the quarter of the year (1, 2, 3, or 4).
Example:
SELECT last_day(current_date()) AS last_day_of_month,
next_day(current_date(), 'Monday') AS next_monday,
quarter(current_date()) AS current_quarter
FROM your_table
Complete Example of a PySpark SQL Query:
Assume you have a table events
with a timestamp
column named event_time
. Here’s a query that selects events from the current year, formats the event date, and extracts various components from the timestamp:
SELECT event_time,
date_format(event_time, 'yyyy-MM-dd') AS formatted_date,
year(event_time) AS event_year,
month(event_time) AS event_month,
dayofmonth(event_time) AS event_day,
hour(event_time) AS event_hour,
minute(event_time) AS event_minute,
second(event_time) AS event_second,
quarter(event_time) AS event_quarter,
weekofyear(event_time) AS event_week
FROM events
WHERE year(event_time) = year(current_date())
Notes:
Integration in PySpark: You can execute SQL queries with date functions using spark.sql()
.
For example:
result = spark.sql(""" SELECT date_add(current_date(), 5) AS five_days_from_now, year(current_timestamp()) AS current_year FROM your_table """) result.show()
Date Functions for Filtering: You can use these functions in the WHERE
clause to filter data based on dates:
SELECT * FROM your_table WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31'
These built-in PySpark SQL date functions provide extensive capabilities for handling, formatting, and manipulating date and time data at scale in Spark applications.
Leave a Reply