In PySpark, you can use various built-in date functions directly within SQL queries to manipulate and extract data from timestamp, date, and string columns. PySpark’s SQL API provides a wide range of date functions similar to SQL, which allows you to perform operations like formatting, adding/subtracting time intervals, and extracting specific components from dates.

Here are some of the most commonly used PySpark date functions in SQL queries:

1. Current Date and Time Functions

  • current_date() – Returns the current date.
  • current_timestamp() – Returns the current timestamp (date and time).
  • now() – Alias for current_timestamp().

Example:

SELECT current_date() AS today_date,
       current_timestamp() AS current_time
FROM your_table

2. Date and Time Arithmetic Functions

  • date_add(date, num_days) – Adds num_days days to the given date.
  • date_sub(date, num_days) – Subtracts num_days days from the given date.
  • add_months(date, num_months) – Adds num_months months to the given date.
  • datediff(end_date, start_date) – Returns the difference between two dates.
  • months_between(end_date, start_date) – Returns the number of months between two dates.

Example:

SELECT date_add(current_date(), 5) AS five_days_from_now,
       date_sub(current_date(), 10) AS ten_days_ago,
       add_months(current_date(), 3) AS three_months_later,
       datediff('2024-12-31', '2024-01-01') AS days_diff
FROM your_table

3. Date Formatting and Parsing Functions

  • date_format(date, format) – Formats a date according to the given format string.
  • to_date(string) – Converts a string to a date.
  • to_timestamp(string) – Converts a string to a timestamp.
  • unix_timestamp() – Converts a date or timestamp to the number of seconds since the Unix epoch (1970-01-01).
  • from_unixtime(unix_time, format) – Converts Unix time to a string in the specified format.

Example:

SELECT date_format(current_timestamp(), 'yyyy-MM-dd') AS formatted_date,
       to_date('2024-09-01', 'yyyy-MM-dd') AS converted_date,
       unix_timestamp('2024-09-01', 'yyyy-MM-dd') AS unix_time,
       from_unixtime(1693564800, 'yyyy-MM-dd') AS from_unix_time
FROM your_table

4. Extracting Components from Dates

  • year(date) – Extracts the year from a date.
  • month(date) – Extracts the month from a date.
  • day(date) or dayofmonth(date) – Extracts the day of the month from a date.
  • hour(timestamp) – Extracts the hour from a timestamp.
  • minute(timestamp) – Extracts the minute from a timestamp.
  • second(timestamp) – Extracts the second from a timestamp.
  • dayofweek(date) – Returns the day of the week (1 = Sunday, 7 = Saturday).
  • weekofyear(date) – Returns the week of the year for a given date.

Example:

SELECT year(current_date()) AS year,
       month(current_date()) AS month,
       day(current_date()) AS day,
       dayofweek(current_date()) AS day_of_week,
       weekofyear(current_date()) AS week_of_year
FROM your_table

5. Date Truncation Functions

  • trunc(date, format) – Truncates a date to the specified unit (year, month, etc.).
  • date_trunc(format, timestamp) – Truncates a timestamp to the specified unit.

Example:

SELECT trunc(current_date(), 'MM') AS truncated_to_month, 
       date_trunc('MM', current_timestamp()) AS timestamp_truncated_to_month
FROM your_table

6.EXTRACT Function in PySpark SQL:-

In PySpark SQL (or standard SQL), extract() is a built-in function for extracting parts of a date or timestamp, such as extracting the year, month, day, hour, minute, etc., from a date or timestamp column.

Syntax:

EXTRACT(field FROM source)
  • field: Specifies the part of the date/time to extract (e.g., YEAR, MONTH, DAY, HOUR, MINUTE, etc.).
  • source: The column containing a date or timestamp.

Example Usage of extract() in PySpark SQL Queries:

If you have a PySpark DataFrame with a date column and you want to extract specific components (such as year or month), you can use extract() in a SQL query.

Example 1: Extracting Year, Month, and Day from a Date:

Assume you have a PySpark DataFrame with a date column in YYYY-MM-DD format, and you want to extract the year, month, and day.


from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("ExtractExample").getOrCreate()

# Sample data
data = [("2023-09-01",), ("2024-12-31",), ("2020-01-20",)]
df = spark.createDataFrame(data, ["DateColumn"])

# Create a temporary SQL table
df.createOrReplaceTempView("date_table")

# SQL query to extract year, month, and day
result = spark.sql("""
    SELECT 
        DateColumn,
        EXTRACT(YEAR FROM DateColumn) AS Year,
        EXTRACT(MONTH FROM DateColumn) AS Month,
        EXTRACT(DAY FROM DateColumn) AS Day
    FROM date_table
""")

result.show()

Output:

+----------+----+-----+---+
|DateColumn|Year|Month|Day|
+----------+----+-----+---+
|2023-09-01|2023| 9| 1|
|2024-12-31|2024| 12| 31|
|2020-01-20|2020| 1| 20|
+----------+----+-----+---+

Example 2: Extracting Hour and Minute from a Timestamp:

If you have a timestamp column and you want to extract specific time components such as hour and minute, you can use EXTRACT() in a SQL query.

data = [("2023-09-01 12:30:45",), ("2024-12-31 18:45:00",), ("2020-01-20 07:15:25",)]
df = spark.createDataFrame(data, ["TimestampColumn"])

# Create a temporary SQL table
df.createOrReplaceTempView("timestamp_table")

# SQL query to extract hour and minute from the timestamp
result = spark.sql("""
    SELECT 
        TimestampColumn,
        EXTRACT(HOUR FROM TimestampColumn) AS Hour,
        EXTRACT(MINUTE FROM TimestampColumn) AS Minute
    FROM timestamp_table
""")

result.show()

Output:

+-------------------+----+------+
| TimestampColumn |Hour|Minute|
+-------------------+----+------+
|2023-09-01 12:30:45| 12| 30|
|2024-12-31 18:45:00| 18| 45|
|2020-01-20 07:15:25| 7| 15|
+-------------------+----+------+

Fields Supported by EXTRACT():

You can extract the following parts from a date or timestamp:

  • YEAR: Extracts the year.
  • MONTH: Extracts the month.
  • DAY: Extracts the day.
  • HOUR: Extracts the hour from a timestamp.
  • MINUTE: Extracts the minute from a timestamp.
  • SECOND: Extracts the second from a timestamp.

Using EXTRACT() in PySpark’s SQL Queries:

In PySpark, you typically define a temporary view using createOrReplaceTempView() to run SQL queries with the EXTRACT() function.

Example Scenario:

If you have a dataset with timestamps and you want to run a SQL query to extract the hour and minute, you can write a PySpark SQL query like:

df.createOrReplaceTempView("log_table")

spark.sql("""
    SELECT 
        EXTRACT(HOUR FROM timestamp_column) AS Hour,
        EXTRACT(MINUTE FROM timestamp_column) AS Minute
    FROM log_table
""").show()

This query will return the hour and minute extracted from the timestamp column.

  • The EXTRACT() function in Spark SQL allows you to extract specific parts (like year, month, day, hour, minute) from date or timestamp columns.
  • It is used in SQL-based queries within PySpark, similar to how you’d use SQL functions for querying structured data.

So, if you saw extract() in a PySpark SQL query, it was likely extracting parts of a date or timestamp column.

7. Other Useful Date Functions

  • last_day(date) – Returns the last day of the month for a given date.
  • next_day(date, day_of_week) – Returns the next date after the given date for the specified day of the week.
  • quarter(date) – Returns the quarter of the year (1, 2, 3, or 4).

Example:

SELECT last_day(current_date()) AS last_day_of_month,
       next_day(current_date(), 'Monday') AS next_monday,
       quarter(current_date()) AS current_quarter
FROM your_table

Complete Example of a PySpark SQL Query:

Assume you have a table events with a timestamp column named event_time. Here’s a query that selects events from the current year, formats the event date, and extracts various components from the timestamp:

SELECT event_time,
       date_format(event_time, 'yyyy-MM-dd') AS formatted_date,
       year(event_time) AS event_year,
       month(event_time) AS event_month,
       dayofmonth(event_time) AS event_day,
       hour(event_time) AS event_hour,
       minute(event_time) AS event_minute,
       second(event_time) AS event_second,
       quarter(event_time) AS event_quarter,
       weekofyear(event_time) AS event_week
FROM events
WHERE year(event_time) = year(current_date())

Notes:

Integration in PySpark: You can execute SQL queries with date functions using spark.sql().

For example:

result = spark.sql(""" SELECT date_add(current_date(), 5) AS five_days_from_now, year(current_timestamp()) AS current_year FROM your_table """) result.show()

Date Functions for Filtering: You can use these functions in the WHERE clause to filter data based on dates:

SELECT * FROM your_table WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31'

These built-in PySpark SQL date functions provide extensive capabilities for handling, formatting, and manipulating date and time data at scale in Spark applications.



Pages: 1 2 3 4 5

Pages ( 2 of 5 ): « Previous1 2 345Next »

Discover more from AI HitsToday

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About the HintsToday

AI HintsToday is One Stop Adda to learn All about AI, Data, ML, Stat Learning, SAS, SQL, Python, Pyspark. AHT is Future!

Explore the Posts

Latest Comments

Latest posts

Discover more from AI HitsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading