To generate a dynamic list of the next and previous 12 months in yyyyMM format

you can generate a dynamic list of the next and previous 12 months in yyyyMM format using PySpark SQL functions like date_format() and date_add(). PySpark provides built-in functions that make it easier to manipulate dates directly in SQL queries without needing to rely on Python’s datetime module or manually handling date calculations.

Here’s how you can do it purely using PySpark SQL functions:

Example Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, expr

# Initialize Spark session
spark = SparkSession.builder 
    .appName("Generate YearMonth List using SQL with .format()") 
    .getOrCreate()

# Create a DataFrame with a single row and the current date
df = spark.createDataFrame([(1,)], ["id"]).withColumn("current_date", current_date())

# Generate SQL expressions for the next 12 months and the previous 12 months using .format()
# We are using a range from -12 to +12 (last 12 months, current month, and next 12 months)
exprs = []
for i in range(-12, 13):
    # Using .format() to create dynamic SQL expressions
    exprs.append("date_format(date_add(current_date, {}), 'yyyyMM') as month_{}".format(i * 30, i))

# Select these dynamically generated columns in the PySpark DataFrame
months_df = df.selectExpr(*exprs)

# Show the DataFrame with all the generated months
months_df.show(truncate=False)

Explanation:

  1. current_date(): Generates the current date. This is a PySpark function that automatically gets the current system date.
  2. date_add(current_date, {i * 30}): Adds or subtracts 30 days (approximating 1 month) to the current date. We use a range from -12 to 12 to get the past and future months.
  3. date_format(): Converts the resulting date to the yyyyMM format.
  4. selectExpr(): Dynamically selects columns using SQL expressions for each month in the range from -12 to +12, creating column names like month_-12, month_0, month_12.

Output Example:

+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
|month_-12|month_-11|month_-10|month_-9 |month_-8 |month_-7 |month_-6 |month_-5 |month_-4 |month_-3 |month_-2 |month_-1 |month_0 |month_1 |month_2 |month_3 |month_4 |month_5 |month_6 |month_7 |month_8 |month_9 |month_10 |month_11 |month_12 |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
|202309 |202310 |202311 |202312 |202401 |202402 |202403 |202404 |202405 |202406 |202407 |202408 |202409 |202410 |202411 |202412 |202501 |202502 |202503 |202504 |202505 |202506 |202507 |202508 |202509 |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+

Explanation of Output:

  • month_-12 to month_12: These columns represent the past 12 months, the current month, and the next 12 months in yyyyMM format.

How to Access Individual Months:

You can select specific months or access them programmatically by collecting the results:

# Collect results into a Python list
months_list = months_df.collect()[0]

# Example: Accessing the current month and next month
current_month = months_list["month_0"]
next_month = months_list["month_1"]

print("Current Month:", current_month)
print("Next Month:", next_month)

Storing Results in Variables or DataFrame:

If you want to store the results as separate variables or use them for filtering, you can extract the values from the DataFrame as shown above and use them as needed.

Filtering Example:

If you want to filter your data using these month values, you can convert the months into a filter condition:

# Example: Filter another DataFrame by one of the generated months
data_df.filter(data_df["year_month_column"] == months_list["month_0"]).show()

Advantages of Using PySpark SQL Functions:

  • Efficiency: PySpark’s built-in functions like date_add() and date_format() are optimized and can handle large datasets efficiently in a distributed environment.
  • Compatibility: This method ensures compatibility with older and newer versions of PySpark since SQL expressions and functions like date_format() have been available for a long time.

Above Task can also be done by combining python Date Functionality.

from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Initialize Spark session
spark = SparkSession.builder 
    .appName("Generate YearMonth List") 
    .getOrCreate()

# Get the current date
current_date = datetime.now()

# Function to generate year-month (yyyyMM) format for a given date
def generate_year_month(date):
    return date.strftime('%Y%m')

# Create a list to store the year-month values
months_list = []

# Generate the next 12 months (including the current month)
for i in range(0, 13):
    future_date = current_date + timedelta(days=30 * i)  # Approximate to 30 days per month
    months_list.append(generate_year_month(future_date))

# Generate the last 12 months
for i in range(1, 13):
    past_date = current_date - timedelta(days=30 * i)  # Approximate to 30 days per month
    months_list.append(generate_year_month(past_date))

# Convert the list into a DataFrame for further operations in PySpark
months_rdd = spark.sparkContext.parallelize(months_list)
months_df = spark.createDataFrame(months_rdd.map(lambda x: Row(month=x)))

# Show the DataFrame
months_df.show()

# Example of accessing specific months in the list
print("Next month: ", months_list[1])
print("Previous month: ", months_list[-1])

Pages: 1 2 3 4 5

Pages ( 5 of 5 ): « Previous1 34 5

Discover more from AI HitsToday

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About the HintsToday

AI HintsToday is One Stop Adda to learn All about AI, Data, ML, Stat Learning, SAS, SQL, Python, Pyspark. AHT is Future!

Explore the Posts

Latest Comments

Latest posts

Discover more from AI HitsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading