Date and Time Functions- Pyspark Dataframes & Pyspark Sql Queries

by lochan2014 | Dec 8, 2024 | Pyspark | 0 comments


PySpark date functions along with detailed explanations and examples on Spark SQL

In PySpark, you can use various built-in date functions directly within SQL queries to manipulate and extract data from timestamp, date, and string columns. PySpark’s SQL API provides a wide range of date functions similar to SQL, which allows you to perform operations like formatting, adding/subtracting time intervals, and extracting specific components from dates.

Here are some of the most commonly used PySpark date functions in SQL queries:

1. Current Date and Time Functions

  • current_date() – Returns the current date.
  • current_timestamp() – Returns the current timestamp (date and time).
  • now() – Alias for current_timestamp().

Example:

SELECT current_date() AS today_date,
       current_timestamp() AS current_time
FROM your_table

2. Date and Time Arithmetic Functions

  • date_add(date, num_days) – Adds num_days days to the given date.
  • date_sub(date, num_days) – Subtracts num_days days from the given date.
  • add_months(date, num_months) – Adds num_months months to the given date.
  • datediff(end_date, start_date) – Returns the difference between two dates.
  • months_between(end_date, start_date) – Returns the number of months between two dates.

Example:

SELECT date_add(current_date(), 5) AS five_days_from_now,
       date_sub(current_date(), 10) AS ten_days_ago,
       add_months(current_date(), 3) AS three_months_later,
       datediff('2024-12-31', '2024-01-01') AS days_diff
FROM your_table

3. Date Formatting and Parsing Functions

  • date_format(date, format) – Formats a date according to the given format string.
  • to_date(string) – Converts a string to a date.
  • to_timestamp(string) – Converts a string to a timestamp.
  • unix_timestamp() – Converts a date or timestamp to the number of seconds since the Unix epoch (1970-01-01).
  • from_unixtime(unix_time, format) – Converts Unix time to a string in the specified format.

Example:

SELECT date_format(current_timestamp(), 'yyyy-MM-dd') AS formatted_date,
       to_date('2024-09-01', 'yyyy-MM-dd') AS converted_date,
       unix_timestamp('2024-09-01', 'yyyy-MM-dd') AS unix_time,
       from_unixtime(1693564800, 'yyyy-MM-dd') AS from_unix_time
FROM your_table

4. Extracting Components from Dates

  • year(date) – Extracts the year from a date.
  • month(date) – Extracts the month from a date.
  • day(date) or dayofmonth(date) – Extracts the day of the month from a date.
  • hour(timestamp) – Extracts the hour from a timestamp.
  • minute(timestamp) – Extracts the minute from a timestamp.
  • second(timestamp) – Extracts the second from a timestamp.
  • dayofweek(date) – Returns the day of the week (1 = Sunday, 7 = Saturday).
  • weekofyear(date) – Returns the week of the year for a given date.

Example:

SELECT year(current_date()) AS year,
       month(current_date()) AS month,
       day(current_date()) AS day,
       dayofweek(current_date()) AS day_of_week,
       weekofyear(current_date()) AS week_of_year
FROM your_table

5. Date Truncation Functions

  • trunc(date, format) – Truncates a date to the specified unit (year, month, etc.).
  • date_trunc(format, timestamp) – Truncates a timestamp to the specified unit.

Example:

SELECT trunc(current_date(), 'MM') AS truncated_to_month,
       date_trunc('MM', current_timestamp()) AS timestamp_truncated_to_month
FROM your_table

6.EXTRACT Function in PySpark SQL:-

In PySpark SQL (or standard SQL), extract() is a built-in function for extracting parts of a date or timestamp, such as extracting the year, month, day, hour, minute, etc., from a date or timestamp column.

Syntax:

EXTRACT(field FROM source)
  • field: Specifies the part of the date/time to extract (e.g., YEAR, MONTH, DAY, HOUR, MINUTE, etc.).
  • source: The column containing a date or timestamp.

Example Usage of extract() in PySpark SQL Queries:

If you have a PySpark DataFrame with a date column and you want to extract specific components (such as year or month), you can use extract() in a SQL query.

Example 1: Extracting Year, Month, and Day from a Date:

Assume you have a PySpark DataFrame with a date column in YYYY-MM-DD format, and you want to extract the year, month, and day.


from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("ExtractExample").getOrCreate()

# Sample data
data = [("2023-09-01",), ("2024-12-31",), ("2020-01-20",)]
df = spark.createDataFrame(data, ["DateColumn"])

# Create a temporary SQL table
df.createOrReplaceTempView("date_table")

# SQL query to extract year, month, and day
result = spark.sql("""
    SELECT
        DateColumn,
        EXTRACT(YEAR FROM DateColumn) AS Year,
        EXTRACT(MONTH FROM DateColumn) AS Month,
        EXTRACT(DAY FROM DateColumn) AS Day
    FROM date_table
""")

result.show()

Output:

+----------+----+-----+---+
|DateColumn|Year|Month|Day|
+----------+----+-----+---+
|2023-09-01|2023| 9| 1|
|2024-12-31|2024| 12| 31|
|2020-01-20|2020| 1| 20|
+----------+----+-----+---+

Example 2: Extracting Hour and Minute from a Timestamp:

If you have a timestamp column and you want to extract specific time components such as hour and minute, you can use EXTRACT() in a SQL query.

data = [("2023-09-01 12:30:45",), ("2024-12-31 18:45:00",), ("2020-01-20 07:15:25",)]
df = spark.createDataFrame(data, ["TimestampColumn"])

# Create a temporary SQL table
df.createOrReplaceTempView("timestamp_table")

# SQL query to extract hour and minute from the timestamp
result = spark.sql("""
    SELECT
        TimestampColumn,
        EXTRACT(HOUR FROM TimestampColumn) AS Hour,
        EXTRACT(MINUTE FROM TimestampColumn) AS Minute
    FROM timestamp_table
""")

result.show()

Output:

+-------------------+----+------+
| TimestampColumn |Hour|Minute|
+-------------------+----+------+
|2023-09-01 12:30:45| 12| 30|
|2024-12-31 18:45:00| 18| 45|
|2020-01-20 07:15:25| 7| 15|
+-------------------+----+------+

Fields Supported by EXTRACT():

You can extract the following parts from a date or timestamp:

  • YEAR: Extracts the year.
  • MONTH: Extracts the month.
  • DAY: Extracts the day.
  • HOUR: Extracts the hour from a timestamp.
  • MINUTE: Extracts the minute from a timestamp.
  • SECOND: Extracts the second from a timestamp.

Using EXTRACT() in PySpark’s SQL Queries:

In PySpark, you typically define a temporary view using createOrReplaceTempView() to run SQL queries with the EXTRACT() function.

Example Scenario:

If you have a dataset with timestamps and you want to run a SQL query to extract the hour and minute, you can write a PySpark SQL query like:

df.createOrReplaceTempView("log_table")

spark.sql("""
    SELECT
        EXTRACT(HOUR FROM timestamp_column) AS Hour,
        EXTRACT(MINUTE FROM timestamp_column) AS Minute
    FROM log_table
""").show()

This query will return the hour and minute extracted from the timestamp column.

  • The EXTRACT() function in Spark SQL allows you to extract specific parts (like year, month, day, hour, minute) from date or timestamp columns.
  • It is used in SQL-based queries within PySpark, similar to how you’d use SQL functions for querying structured data.

So, if you saw extract() in a PySpark SQL query, it was likely extracting parts of a date or timestamp column.

7. Other Useful Date Functions

  • last_day(date) – Returns the last day of the month for a given date.
  • next_day(date, day_of_week) – Returns the next date after the given date for the specified day of the week.
  • quarter(date) – Returns the quarter of the year (1, 2, 3, or 4).

Example:

SELECT last_day(current_date()) AS last_day_of_month,
       next_day(current_date(), 'Monday') AS next_monday,
       quarter(current_date()) AS current_quarter
FROM your_table

Complete Example of a PySpark SQL Query:

Assume you have a table events with a timestamp column named event_time. Here’s a query that selects events from the current year, formats the event date, and extracts various components from the timestamp:

SELECT event_time,
       date_format(event_time, 'yyyy-MM-dd') AS formatted_date,
       year(event_time) AS event_year,
       month(event_time) AS event_month,
       dayofmonth(event_time) AS event_day,
       hour(event_time) AS event_hour,
       minute(event_time) AS event_minute,
       second(event_time) AS event_second,
       quarter(event_time) AS event_quarter,
       weekofyear(event_time) AS event_week
FROM events
WHERE year(event_time) = year(current_date())

Notes:

Integration in PySpark: You can execute SQL queries with date functions using spark.sql().

For example:

result = spark.sql(""" SELECT date_add(current_date(), 5) AS five_days_from_now, year(current_timestamp()) AS current_year FROM your_table """) result.show()

Date Functions for Filtering: You can use these functions in the WHERE clause to filter data based on dates:

SELECT * FROM your_table WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31'

These built-in PySpark SQL date functions provide extensive capabilities for handling, formatting, and manipulating date and time data at scale in Spark applications.


Written By

undefined

Related Posts

Submit a Comment

Your email address will not be published. Required fields are marked *