Python’s built-in date functions, such as those in the datetime
module, are not designed to work in a distributed environment like PySpark.
In PySpark, when you use Python’s built-in date functions, they will only work on the driver node, not on the worker nodes. This is because the datetime
module is not serializable, meaning it can’t be sent to the worker nodes for execution.
To perform date-related operations in PySpark, you should use PySpark’s built-in date functions, which are designed to work in a distributed environment. These functions include:
pyspark.sql.functions.current_date()
pyspark.sql.functions.current_timestamp()
pyspark.sql.functions.date_format()
pyspark.sql.functions.date_trunc()
pyspark.sql.functions.dayofmonth()
pyspark.sql.functions.dayofyear()
pyspark.sql.functions.from_unixtime()
pyspark.sql.functions.hour()
pyspark.sql.functions.minute()
pyspark.sql.functions.month()
pyspark.sql.functions.quarter()
pyspark.sql.functions.second()
pyspark.sql.functions.to_date()
pyspark.sql.functions.to_timestamp()
pyspark.sql.functions.unix_timestamp()
pyspark.sql.functions.weekofyear()
pyspark.sql.functions.year()
Here are all the PySpark date functions:
add_months(date, months)
: Returns the date that ismonths
months afterdate
.current_date()
: Returns the current date.current_timestamp()
: Returns the current timestamp.date_format(date, format)
: Converts a date to a string using the specified format.date_trunc(format, timestamp)
: Truncates a timestamp to the specified format.dayofmonth(date)
: Returns the day of the month from a date.dayofyear(date)
: Returns the day of the year from a date.from_unixtime(unix_timestamp)
: Converts a Unix timestamp to a timestamp.from_utc_timestamp(timestamp, tz)
: Converts a timestamp from UTC to the specified timezone.hour(timestamp)
: Returns the hour from a timestamp.last_day(date)
: Returns the last day of the month from a date.minute(timestamp)
: Returns the minute from a timestamp.month(date)
: Returns the month from a date.next_day(date, dayOfWeek)
: Returns the next day of the week from a date.quarter(date)
: Returns the quarter from a date.second(timestamp)
: Returns the second from a timestamp.to_date(timestamp)
: Converts a timestamp to a date.to_timestamp(date)
: Converts a date to a timestamp.to_utc_timestamp(timestamp, tz)
: Converts a timestamp to UTC from the specified timezone.trunc(date, format)
: Truncates a date to the specified format.unix_timestamp(timestamp)
: Converts a timestamp to a Unix timestamp.weekofyear(date)
: Returns the week of the year from a date.year(date)
: Returns the year from a date.
And here are some examples of using these functions:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a sample DataFrame
data = [("2022-01-01",), ("2022-01-02",), ("2022-01-03",)]
df = spark.createDataFrame(data, ["date"])
# Use PySpark date functions
df = df.withColumn("day_of_month", dayofmonth("date"))
df = df.withColumn("month", month("date"))
df = df.withColumn("year", year("date"))
df = df.withColumn("next_day", next_day("date", "Sunday"))
df = df.withColumn("last_day", last_day("date"))
# Show the results
df.show()
Discover more from HintsToday
Subscribe to get the latest posts sent to your email.