Date and Time Functions- Pyspark Dataframes & Pyspark Sql Queries

by lochan2014 | Dec 8, 2024 | Pyspark | 0 comments

A quick reference for date manipulation in PySpark:–

Function	Description	Works On	Example (Spark SQL)	Example (DataFrame API)
`to_date`	Converts string to date.	String	`TO_DATE('2024-01-15', 'yyyy-MM-dd')`	`to_date(col("date_str"), "yyyy-MM-dd")`
`to_timestamp`	Converts string to timestamp.	String	`TO_TIMESTAMP('2024-01-15 12:34:56', 'yyyy-MM-dd HH:mm:ss')`	`to_timestamp(col("timestamp_str"), "yyyy-MM-dd HH:mm:ss")`
`date_format`	Formats date or timestamp as a string.	Date, Timestamp	`DATE_FORMAT(CURRENT_DATE, 'dd-MM-yyyy')`	`date_format(col("date_col"), "dd-MM-yyyy")`
`current_date`	Returns the current date.	–	`CURRENT_DATE`	`current_date()`
`current_timestamp`	Returns the current timestamp.	–	`CURRENT_TIMESTAMP`	`current_timestamp()`
`date_add`	Adds days to a date.	Date, Timestamp	`DATE_ADD('2024-01-15', 10)`	`date_add(col("date_col"), 10)`
`date_sub`	Subtracts days from a date.	Date, Timestamp	`DATE_SUB('2024-01-15', 10)`	`date_sub(col("date_col"), 10)`
`months_between`	Returns months between two dates or timestamps.	Date, Timestamp	`MONTHS_BETWEEN('2024-01-15', '2023-12-15')`	`months_between(col("date1"), col("date2"))`
`datediff`	Returns difference in days between two dates/timestamps.	Date, Timestamp	`DATEDIFF('2024-01-15', '2024-01-10')`	`datediff(col("date1"), col("date2"))`
`year`	Extracts year from a date or timestamp.	Date, Timestamp	`YEAR('2024-01-15')`	`year(col("date_col"))`
`month`	Extracts month from a date or timestamp.	Date, Timestamp	`MONTH('2024-01-15')`	`month(col("date_col"))`
`day`	Extracts day from a date or timestamp.	Date, Timestamp	`DAY('2024-01-15')`	`day(col("date_col"))`
`dayofweek`	Returns the day of the week (1 = Sunday, 7 = Saturday).	Date, Timestamp	`DAYOFWEEK('2024-01-15')`	`dayofweek(col("date_col"))`
`dayofmonth`	Returns the day of the month (1-31).	Date, Timestamp	`DAYOFMONTH('2024-01-15')`	`dayofmonth(col("date_col"))`
`dayofyear`	Returns the day of the year (1-366).	Date, Timestamp	`DAYOFYEAR('2024-01-15')`	`dayofyear(col("date_col"))`
`weekofyear`	Returns the week number of the year.	Date, Timestamp	`WEEKOFYEAR('2024-01-15')`	`weekofyear(col("date_col"))`
`quarter`	Extracts quarter of the year.	Date, Timestamp	`QUARTER('2024-01-15')`	`quarter(col("date_col"))`
`hour`	Extracts hour from a timestamp.	Timestamp	`HOUR('2024-01-15 12:34:56')`	`hour(col("timestamp_col"))`
`minute`	Extracts minute from a timestamp.	Timestamp	`MINUTE('2024-01-15 12:34:56')`	`minute(col("timestamp_col"))`
`second`	Extracts second from a timestamp.	Timestamp	`SECOND('2024-01-15 12:34:56')`	`second(col("timestamp_col"))`
`last_day`	Returns the last day of the month for a date.	Date, Timestamp	`LAST_DAY('2024-01-15')`	`last_day(col("date_col"))`
`next_day`	Returns the next day of the week after a date.	Date, Timestamp	`NEXT_DAY('2024-01-15', 'Sunday')`	`next_day(col("date_col"), "Sunday")`
`trunc`	Truncates a date to the specified part (e.g., month).	Date, Timestamp	`TRUNC('2024-01-15', 'MM')`	`trunc(col("date_col"), "MM")`
`add_months`	Adds months to a date.	Date, Timestamp	`ADD_MONTHS('2024-01-15', 2)`	`add_months(col("date_col"), 2)`
`unix_timestamp`	Converts a string to UNIX timestamp.	String	`UNIX_TIMESTAMP('2024-01-15', 'yyyy-MM-dd')`	`unix_timestamp(col("date_str"), "yyyy-MM-dd")`
`from_unixtime`	Converts UNIX timestamp to a formatted string.	Integer (UNIX timestamp)	`FROM_UNIXTIME(1673827200, 'yyyy-MM-dd')`	`from_unixtime(col("unix_col"), "yyyy-MM-dd")`

Now starts our Post:-

PySpark date functions along with detailed explanations and examples on Dataframes

Here’s a comprehensive list of some common PySpark date functions along with detailed explanations and examples on Dataframes(We will again discuss these basis Pyspark sql Queries):

1. `current_date()`

Returns the current date.

from pyspark.sql.functions import current_date

df = spark.createDataFrame([(1,)], ["dummy"])
df.select(current_date().alias("current_date")).show()

2. `current_timestamp()`

Returns the current timestamp.

from pyspark.sql.functions import current_timestamp

df.select(current_timestamp().alias("current_timestamp")).show()

3. `date_format()`

Formats a date using the specified format.

from pyspark.sql.functions import date_format

df = spark.createDataFrame([("2010-02-05",)], ["date"])
df.select(date_format("date", "MM/dd/yyyy").alias("formatted_date")).show()

4. `year()`, `month()`, `dayofmonth()`

Extracts the year, month, and day from a date.

from pyspark.sql.functions import year, month, dayofmonth

df.select(year("date").alias("year"),
          month("date").alias("month"),
          dayofmonth("date").alias("day")).show()

5. `date_add()`, `date_sub()`

Adds or subtracts a specified number of days to/from a date.

from pyspark.sql.functions import date_add, date_sub

df.select(date_add("date", 10).alias("date_add"),date_sub("date", 10).alias("date_sub")).show()

6. `datediff()`

Returns the difference in days between two dates.

from pyspark.sql.functions import datediff

df2 = spark.createDataFrame([("2010-02-15",)], ["date2"])
df.join(df2).select(datediff("date2", "date").alias("datediff")).show()

7. `add_months()`

Adds a specified number of months to a date.

from pyspark.sql.functions import add_months

df.select(add_months("date", 1).alias("add_months")).show()

8. `months_between()`

Returns the number of months between two dates.

from pyspark.sql.functions import months_between

df.join(df2).select(months_between("date2", "date").alias("months_between")).show()

9. `next_day()`

Returns the first date which is the specified day of the week after the given date.

from pyspark.sql.functions import next_day

df.select(next_day("date", "Sunday").alias("next_day")).show()

10. `last_day()`

Returns the last day of the month which the given date belongs to.

from pyspark.sql.functions import last_day

df.select(last_day("date").alias("last_day")).show()

11. `trunc()`

Truncates a date to the specified unit (‘year’, ‘month’).

from pyspark.sql.functions import trunc

df.select(trunc("date", "MM").alias("trunc_month"),
          trunc("date", "YY").alias("trunc_year")).show()

12. `date_trunc()`

Truncates a timestamp to the specified unit (‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’).

from pyspark.sql.functions import date_trunc

df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp"])
df.select(date_trunc("hour", "timestamp").alias("date_trunc_hour")).show()

13. `from_unixtime()`

Converts the number of seconds from the Unix epoch to a string representing the timestamp.

from pyspark.sql.functions import from_unixtime

df = spark.createDataFrame([(1234567890,)], ["unix_time"])
df.select(from_unixtime("unix_time").alias("timestamp")).show()

14. `unix_timestamp()`

Converts a string timestamp to the number of seconds from the Unix epoch.

from pyspark.sql.functions import unix_timestamp

df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp"])
df.select(unix_timestamp("timestamp").alias("unix_timestamp")).show()

15. `to_date()`

Converts a string to a date.

from pyspark.sql.functions import to_date

df = spark.createDataFrame([("2010-02-05",)], ["date_str"])
df.select(to_date("date_str").alias("date")).show()

16. `to_timestamp()`

Converts a string to a timestamp.

from pyspark.sql.functions import to_timestamp

df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp_str"])
df.select(to_timestamp("timestamp_str").alias("timestamp")).show()

17. `weekofyear()`

Returns the week of the year for a date.

from pyspark.sql.functions import weekofyear

df = spark.createDataFrame([("2010-02-05",)], ["date"])
df.select(weekofyear("date").alias("weekofyear")).show()

18. `quarter()`

Returns the quarter of the year for a date.

from pyspark.sql.functions import quarter

df.select(quarter("date").alias("quarter")).show()

19. `dayofweek()`

Returns the day of the week for a date.

from pyspark.sql.functions import dayofweek

df.select(dayofweek("date").alias("dayofweek")).show()

Example Project: Combining Multiple Date Functions

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, date_format, year, month, dayofmonth, date_add, date_sub, datediff, add_months, months_between, next_day, last_day, trunc, date_trunc, from_unixtime, unix_timestamp, to_date, to_timestamp, weekofyear, quarter, dayofweek

# Initialize Spark session
spark = SparkSession.builder.appName("date_functions_example").getOrCreate()

# Create a DataFrame with a sample date
data = [("2010-02-05", "2010-02-15 12:34:56", 1234567890)]
columns = ["date", "timestamp", "unix_time"]
df = spark.createDataFrame(data, columns)

# Applying various date functions
result = df.select(
    col("date"),
    col("timestamp"),
    col("unix_time"),
    date_format("date", "MM/dd/yyyy").alias("formatted_date"),
    year("date").alias("year"),
    month("date").alias("month"),
    dayofmonth("date").alias("day"),
    date_add("date", 10).alias("date_add"),
    date_sub("date", 10).alias("date_sub"),
    datediff("timestamp", "date").alias("datediff"),
    add_months("date", 1).alias("add_months"),
    months_between("timestamp", "date").alias("months_between"),
    next_day("date", "Sunday").alias("next_day"),
    last_day("date").alias("last_day"),
    trunc("date", "MM").alias("trunc_month"),
    trunc("date", "YY").alias("trunc_year"),
    date_trunc("hour", "timestamp").alias("date_trunc_hour"),
    from_unixtime("unix_time").alias("from_unixtime"),
    unix_timestamp("timestamp").alias("unix_timestamp"),
    to_date("timestamp").alias("to_date"),
    to_timestamp("timestamp").alias("to_timestamp"),
    weekofyear("date").alias("weekofyear"),
    quarter("date").alias("quarter"),
    dayofweek("date").alias("dayofweek")
)

# Show the results
result.show(truncate=False)

Function	Description	Example
current_date()	Returns the current date as a DateType object	`current_date()`
current_timestamp()	Returns the current timestamp (with microseconds) as a TimestampType object	`current_timestamp()`
date_add(date, days)	Adds a specified number of days to a date	`date_add('2023-07-05', 10)`
datediff(end_date, start_date)	Calculates the number of days between two dates	`datediff('2024-01-01', '2023-12-31')`
months_between(date1, date2)	Calculates the number of months between two dates	`months_between('2024-02-01', '2023-07-05')`
to_date(string)	Converts a string to a DateType object (format-dependent)	`to_date('2023-07-04', 'yyyy-MM-dd')`
to_timestamp(string)	Converts a string to a TimestampType object (format-dependent)	`to_timestamp('2023-07-04 10:20:30', 'yyyy-MM-dd HH:mm:ss')`
trunc(date, format)	Truncates a date to a specified unit (year, month, day)	`trunc('2023-07-05', 'year')`
year(date)	Extracts the year from a date	`year('2023-07-05')`
quarter(date)	Extracts the quarter (1-4) from a date	`quarter('2023-07-05')`
month(date)	Extracts the month (1-12) from a date	`month('2023-07-05')`
dayofmonth(date)	Extracts the day of the month (1-31) from a date	`dayofmonth('2023-07-05')`
dayofweek(date)	Extracts the day of the week (1-7, Monday=1) from a date	`dayofweek('2023-07-05')`
dayofyear(date)	Extracts the day of the year (1-365) from a date	`dayofyear('2023-07-05')`
last_day(date)	Returns the last day of the month for the given date	`last_day('2023-07-05')`
unix_timestamp()	Returns the number of seconds since epoch (1970-01-01 UTC) for a timestamp	`unix_timestamp()`
from_unixtime(timestamp)	Converts a unix timestamp to a TimestampType object	`from_unixtime(1656816000)`
date_format(date, format)	Formats a date or timestamp according to a specified format	`date_format('2023-07-05', 'dd-MM-yyyy')`

← Window functions in PySpark on Dataframe programmingHow to Write Perfect Pseudocode- Syntax , Standards, Terms →

Written By

undefined

PySpark SQL API Programming- How To, Approaches, Optimization

Feb 9, 2025 | Pyspark

In PySpark, DataFrame transformations and operations can be efficiently handled using two main approaches: 1️⃣ PySpark SQL API Programming (Temp Tables / Views) Each transformation step can be written as a SQL query. Intermediate results can be stored as temporary…

How the Python interpreter reads and processes a Python script and Memory Management in Python

Feb 8, 2025 | Python

I believe you read our Post https://www.hintstoday.com/i-did-python-coding-or-i-wrote-a-python-script-and-got-it-exected-so-what-it-means/. Before starting here kindly go through the Link. How the Python interpreter reads and processes a Python script The Python…

Lists and Tuples in Python – List and Tuple Comprehension, Usecases

Feb 5, 2025 | Python

Python Lists: A Comprehensive Guide What is a List? Lists are a fundamental data structure in Python used to store collections of items. They are: Ordered: Elements maintain a defined sequence. Mutable: Elements can be modified after creation. Defined by: Square…

Date and Time Functions- Pyspark Dataframes & Pyspark Sql Queries

PySpark date functions along with detailed explanations and examples on Dataframes

1. `current_date()`

2. `current_timestamp()`

3. `date_format()`

4. `year()`, `month()`, `dayofmonth()`

5. `date_add()`, `date_sub()`

6. `datediff()`

7. `add_months()`

8. `months_between()`

9. `next_day()`

10. `last_day()`

11. `trunc()`

12. `date_trunc()`

13. `from_unixtime()`

14. `unix_timestamp()`

15. `to_date()`

16. `to_timestamp()`

17. `weekofyear()`

18. `quarter()`

19. `dayofweek()`

Example Project: Combining Multiple Date Functions

Written By

Related Posts

PySpark SQL API Programming- How To, Approaches, Optimization

How the Python interpreter reads and processes a Python script and Memory Management in Python

Lists and Tuples in Python – List and Tuple Comprehension, Usecases

Submit a Comment Cancel reply

Explore Our Tutorials

Connect With Us

About HintsToday

Success!

Date and Time Functions- Pyspark Dataframes & Pyspark Sql Queries

PySpark date functions along with detailed explanations and examples on Dataframes

1. current_date()

2. current_timestamp()

3. date_format()

4. year(), month(), dayofmonth()

5. date_add(), date_sub()

6. datediff()

7. add_months()

8. months_between()

9. next_day()

10. last_day()

11. trunc()

12. date_trunc()

13. from_unixtime()

14. unix_timestamp()

15. to_date()

16. to_timestamp()

17. weekofyear()

18. quarter()

19. dayofweek()

Example Project: Combining Multiple Date Functions

Written By

Related Posts

PySpark SQL API Programming- How To, Approaches, Optimization

How the Python interpreter reads and processes a Python script and Memory Management in Python

Lists and Tuples in Python – List and Tuple Comprehension, Usecases

Submit a Comment Cancel reply

Success!

1. `current_date()`

2. `current_timestamp()`

3. `date_format()`

4. `year()`, `month()`, `dayofmonth()`

5. `date_add()`, `date_sub()`

6. `datediff()`

7. `add_months()`

8. `months_between()`

9. `next_day()`

10. `last_day()`

11. `trunc()`

12. `date_trunc()`

13. `from_unixtime()`

14. `unix_timestamp()`

15. `to_date()`

16. `to_timestamp()`

17. `weekofyear()`

18. `quarter()`

19. `dayofweek()`