Python libraries and functions to manipulate dates and times

Python provides various libraries and functions to manipulate dates and times. Here are some common operations:

Contents

DateTime Library

The datetime library is the primary library for date and time manipulation in Python.

  • datetime.date: Represents a date (year, month, day)
  • datetime.time: Represents a time (hour, minute, second, microsecond)
  • datetime.datetime: Represents a date and time

Visual Representation

          +---------------+
          |  datetime    |
          +---------------+
                  |
                  |
                  v
+---------------+---------------+---------------+
|  date        |  time        |  datetime    |
+---------------+---------------+---------------+
|  (year,      |  (hour,     |  (year,      |
|   month, day) |   minute,   |   month, day,|
|               |   second,   |   hour, minute,|
|               |   microsecond) |   second,   |
|               |               |   microsecond) |
+---------------+---------------+---------------+

Date and Time Operations

Here are some common date and time operations:

  • Current Date and Time
datetime.datetime.now()

Returns the current date and time.

  • Create Date Object
datetime.date(year, month, day)

Creates a date object.

  • Create Time Object
datetime.time(hour, minute, second, microsecond)

Creates a time object.

  • Create DateTime Object
datetime.datetime(year, month, day, hour, minute, second, microsecond)

Creates a date and time object.

  • Date and Time Formatting
strftime()

Formats a date and time object as a string.

Example: datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
  • Date and Time Parsing
strptime()

Parses a string into a date and time object.

Example: datetime.datetime.strptime("2022-01-01 12:00:00", "%Y-%m-%d %H:%M:%S")
  • Date Arithmetic

timedelta

Represents a duration, the difference between two dates or times.

Example: datetime.datetime.now() + datetime.timedelta(days=1)
  • Timezone Handling

pytz

Provides timezone handling functionality.

Example: datetime.datetime.now(pytz.timezone("US/Eastern"))

Visual Representation

          +---------------+
          |  datetime    |
          +---------------+
                  |
                  |
                  v
+---------------+---------------+---------------+
|  now()       |  date()      |  time()      |
+---------------+---------------+---------------+
|  (current    |  (year,      |  (hour,     |
|   date and   |   month, day) |   minute,   |
|   time)      |               |   second,   |
|               |               |   microsecond) |
+---------------+---------------+---------------+
                  |
                  |
                  v
+---------------+---------------+---------------+
|  strftime()  |  strptime()  |  timedelta  |
+---------------+---------------+---------------+
|  (format     |  (parse      |  (duration  |
|   date and   |   string)    |   between   |
|   time)      |               |   dates or  |
|               |               |   times)    |
+---------------+---------------+---------------+
                  |
                  |
                  v
+---------------+---------------+
|  pytz        |
+---------------+
|  (timezone   |
|   handling)  |
+---------------+

The datetime library is the most commonly used library for date and time manipulation.

Key Functions
  • datetime.datetime.now(): Get the current date and time.
  • datetime.datetime.strptime(date_string, format): Parse a string into a datetime object.
  • datetime.datetime.strftime(format): Format a datetime object as a string.
  • datetime.timedelta(): Represent a duration, useful for date arithmetic.
  • datetime.datetime.combine(): Combine date and time into a single datetime object.

time Library

The time library provides functions for working with time in seconds since the epoch (Unix time).

Key Functions

  • time.time(): Get the current time in seconds since the epoch.
  • time.sleep(seconds): Pause the program for the specified number of seconds.

Example

import time

# Get the current time in seconds since the epoch
epoch_time = time.time()
print("Current time (epoch):", epoch_time)

# Pause the program for 5 seconds
print("Sleeping for 5 seconds...")
time.sleep(5)
print("Awake!")

calendar Library

The calendar library provides functions related to the calendar.

Key Functions

  • calendar.month(year, month): Get a string representing the month’s calendar.
  • calendar.isleap(year): Check if a year is a leap year.

Example

import calendar

# Print the calendar for July 2024
july_calendar = calendar.month(2024, 7)
print(july_calendar)

# Check if a year is a leap year
is_leap = calendar.isleap(2024)
print("Is 2024 a leap year?", is_leap)

dateutil Library

The dateutil library extends the datetime library with additional features.

Key Functions

  • dateutil.parser.parse(date_string): Parse a date string into a datetime object.
  • dateutil.relativedelta.relativedelta(): Represent relative time differences, such as months or years.

Example

from dateutil import parser, relativedelta

# Parse a date string
date_string = "24 July, 2024"
parsed_date = parser.parse(date_string)
print("Parsed date:", parsed_date)

# Add 1 month and 10 days to a date
future_date = parsed_date + relativedelta.relativedelta(months=1, days=10)
print("Future date:", future_date)

pytz Library

The pytz library allows accurate and cross-platform timezone calculations.

Key Functions

  • pytz.timezone(zone): Get a timezone object.
  • datetime.datetime.now(pytz.timezone): Get the current time in a specific timezone.

Example

from datetime import datetime
import pytz

# Get the current time in UTC
utc_time = datetime.now(pytz.utc)
print("Current UTC time:", utc_time)

# Convert UTC time to Eastern Time
eastern = pytz.timezone('US/Eastern')
eastern_time = utc_time.astimezone(eastern)
print("Eastern Time:", eastern_time)

Date and Time manipulation operations

Here are some common date and time manipulation operations:

Importing the Required Modules

from datetime import datetime, date, time, timedelta
import time as tm
import calendar

1. Getting the Current Date and Time

# Current date and time
now = datetime.now()
print("Current Date and Time:", now)

# Current date
today = date.today()
print("Current Date:", today)

# Current time
current_time = now.time()
print("Current Time:", current_time)

2. Formatting Dates and Times

# Formatting datetime
formatted_now = now.strftime("%Y-%m-%d %H:%M:%S")
print("Formatted Date and Time:", formatted_now)

# Formatting date
formatted_date = today.strftime("%d-%m-%Y")
print("Formatted Date:", formatted_date)

# Parsing date from string
parsed_date = datetime.strptime("2024-07-24", "%Y-%m-%d")
print("Parsed Date:", parsed_date)

3. Date and Time Arithmetic

# Adding and subtracting days
tomorrow = today + timedelta(days=1)
yesterday = today - timedelta(days=1)
print("Tomorrow's Date:", tomorrow)
print("Yesterday's Date:", yesterday)

# Adding hours, minutes, seconds
new_time = now + timedelta(hours=2, minutes=30)
print("New Time after adding 2 hours and 30 minutes:", new_time)

# Difference between two dates
date_diff = tomorrow - yesterday
print("Difference between Tomorrow and Yesterday:", date_diff)

4. Getting Day, Month, Year, Weekday

# Getting day, month, year
day = today.day
month = today.month
year = today.year
print("Day:", day, "Month:", month, "Year:", year)

# Getting weekday (0: Monday, 6: Sunday)
weekday = today.weekday()
print("Weekday:", weekday)

# Getting ISO weekday (1: Monday, 7: Sunday)
iso_weekday = today.isoweekday()
print("ISO Weekday:", iso_weekday)

5. Time Operations

# Current time in seconds since the epoch
epoch_time = tm.time()
print("Current Time in seconds since the epoch:", epoch_time)

# Sleep for a few seconds
print("Sleeping for 2 seconds...")
tm.sleep(2)
print("Awake now!")

6. Working with Calendars

# Printing a month's calendar
print(calendar.month(2024, 7))

# Printing a year's calendar
print(calendar.calendar(2024))

# Checking if a year is a leap year
is_leap = calendar.isleap(2024)
print("Is 2024 a leap year?", is_leap)

7. Creating and Manipulating Time Objects

# Creating a time object
t = time(14, 30, 45)  # 14:30:45
print("Time Object:", t)

# Accessing hour, minute, second
print("Hour:", t.hour)
print("Minute:", t.minute)
print("Second:", t.second)

8. Combining Date and Time

# Combining date and time into a datetime object
combined = datetime.combine(today, t)
print("Combined Date and Time:", combined)

Example: Working with Timezones

To handle time zones, you can use the pytz library.

import pytz

# Get the current time in UTC
utc_now = datetime.now(pytz.utc)
print("Current UTC Time:", utc_now)

# Convert UTC time to a different timezone
eastern = pytz.timezone('US/Eastern')
eastern_time = utc_now.astimezone(eastern)
print("Eastern Time:", eastern_time)

Above we discussed about Python Date and Time specific libraries. But do the most used Pandas have date functions?

Yes, pandas has its own set of date manipulation functions, which are built on top of the Python datetime module. These functions provide additional functionality and convenience for working with dates in pandas DataFrames.

Some key pandas date manipulation functions include:

Date parsing and formatting

  • pd.to_datetime(): Convert strings to datetime objects
  • pd.to_date(): Convert datetime objects to dates
  • dt.strftime(): Format datetime objects as strings

Date arithmetic

  • dt.add(): Add timedeltas to datetime objects
  • dt.subtract(): Subtract timedeltas from datetime objects
  • dt.floor(): Round down to the nearest time unit (e.g., minute, hour, day)
  • dt.ceil(): Round up to the nearest time unit

Date extraction

  • dt.year: Extract the year from a datetime object
  • dt.month: Extract the month from a datetime object
  • dt.day: Extract the day from a datetime object
  • dt.hour: Extract the hour from a datetime object
  • dt.minute: Extract the minute from a datetime object
  • dt.second: Extract the second from a datetime object

Time delta functions

  • pd.Timedelta(): Create a timedelta object
  • dt.total_seconds(): Convert a timedelta to seconds

Resampling and frequency functions

  • resample(): Resample data by time frequency
  • asfreq(): Convert data to a specified frequency

Holiday and business day functions

  • pd.holiday(): Create a holiday calendar
  • pd.BusinessDay(): Create a business day calendar
  • pd.DateOffset(): Create a date offset object

pandas also provides several date-related indexers and selectors, such as:

  • pd.date_range(): Create a date range index
  • pd.period_range(): Create a period range index
  • pd.DatetimeIndex(): Create a datetime index

These functions provide a more convenient and efficient way to work with dates in pandas DataFrames compared to using the Python datetime module alone.

Example:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'date': ['2022-01-01', '2022-01-15', '2022-02-01']})
df['date'] = pd.to_datetime(df['date'])

# Extract the year and month
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

# Add 3 days to the date
df['new_date'] = df['date'] + pd.Timedelta(days=3)

print(df)

Output:

        date  year  month   new_date
0 2022-01-01  2022      1 2022-01-04
1 2022-01-15  2022      1 2022-01-18
2 2022-02-01  2022      2 2022-02-04

here comes another question:- pd.Timedelta(days=3) is pandas inbuilt or do we have to call datetime?

so in above case, pd.Timedelta(days=3) is a pandas built-in function.

pd.Timedelta is a pandas-specific class that represents a duration, which is a difference between two dates or times.

While datetime.timedelta is a Python built-in class from the datetime module.

Both pd.Timedelta and datetime.timedelta can be used to represent time durations, but they have some differences:

pandas pd.Timedelta:

  • Designed specifically for pandas date and time operations.
  • Supports pandas-specific frequency codes (e.g., ‘D’, ‘B’, ‘H’, ‘T’, ‘S’).
  • Integrates well with pandas DataFrames and Series.

Python datetime.timedelta:

  • Part of the Python datetime module.
  • More general-purpose, can be used outside of pandas.
  • Does not support pandas-specific frequency codes.

You can use either pd.Timedelta or datetime.timedelta depending on your specific use case:

import pandas as pd
import datetime

# pandas Timedelta
pd_timedelta = pd.Timedelta(days=3)
print(type(pd_timedelta))  # <class 'pandas._libs.tslibs.timedeltas.Timedelta'>

# datetime timedelta
dt_timedelta = datetime.timedelta(days=3)
print(type(dt_timedelta))  # <class 'datetime.timedelta'>

In the example you provided, pd.Timedelta(days=3) is used to add 3 days to the date column in a pandas DataFrame.

You can achieve the same result using datetime.timedelta:

import pandas as pd
import datetime

df = pd.DataFrame({'date': ['2022-01-01']})
df['date'] = pd.to_datetime(df['date'])
df['new_date'] = df['date'] + datetime.timedelta(days=3)
print(df)

Interesting Use Cases

  1. Automated Reports: Use datetime to schedule daily, weekly, or monthly reports.
  2. Reminder Systems: Use time.sleep() to create a simple reminder system that notifies the user at specified intervals.
  3. Timezone Conversions: Use pytz to handle internationalization of applications by converting times between different timezones.
  4. Scheduling Tasks: Combine datetime and dateutil.relativedelta to schedule tasks to run at specific intervals, like every first Monday of the month.
  5. Leap Year Calculations: Use calendar.isleap() to handle leap year specific logic in date calculations.
Example: Automating a Daily Task
import datetime
import time

def daily_task():
    print("Performing daily task...")

# Schedule the task to run at 8:00 AM every day
target_time = datetime.time(hour=8, minute=0, second=0)

while True:
    now = datetime.datetime.now()
    current_time = now.time()

    if current_time >= target_time:
        daily_task()
        # Sleep until the same time next day
        time_to_sleep = 24*60*60 - (now - now.replace(hour=8, minute=0, second=0)).seconds
    else:
        # Sleep until target time today
        time_to_sleep = (datetime.datetime.combine(now.date(), target_time) - now).seconds

    time.sleep(time_to_sleep)

This example demonstrates how to schedule a daily task to run at 8:00 AM every day using the datetime and time libraries.

A example project:-

  1. Generate Dynamic Variable Names for the Next 24 Months
  2. Store the Variable Names in a Dictionary
  3. Use the Dictionary to Create CSV Files
  4. Create Arrays Based on the Dictionary
  5. Generate Custom Excel Files
  6. Define Functions for Common Data Quality Tests and Reporting

Here is a step-by-step implementation in Python:

Step 1: Generate Dynamic Variable Names for the Next 24 Months

We will generate dynamic variable names based on the current date and store them in a dictionary.

import pandas as pd
from datetime import datetime, timedelta
import os

# Function to generate the next 24 months
def generate_months():
    current_date = datetime.now()
    months = {}
    for i in range(24):
        month_year = current_date.strftime("%Y%m")
        months[f'Month_{month_year}'] = f'{current_date.year}_{current_date.month:02d}'
        current_date += timedelta(days=30)
    return months

# Generate the months
months_dict = generate_months()
print(months_dict)

Step 2: Store the Variable Names in a Dictionary

We store the generated month variable names in a dictionary for further use.

Step 3: Use the Dictionary to Create CSV Files

We will create CSV files based on the dictionary. For simplicity, let’s assume we have a sample DataFrame that we filter by year and month to create CSV files.

# Sample DataFrame
data = {
    'date': pd.date_range(start='2023-01-01', periods=365, freq='D'),
    'value': range(365)
}
df = pd.DataFrame(data)

# Function to create CSV files for each month
def create_csv_files(df, months_dict):
    for month_name, ym in months_dict.items():
        year, month = ym.split('_')
        filtered_df = df[(df['date'].dt.year == int(year)) & (df['date'].dt.month == int(month))]
        filtered_df.to_csv(f'{month_name}.csv', index=False)

# Create CSV files
create_csv_files(df, months_dict)

Step 4: Create Arrays Based on the Dictionary

We can create arrays based on the dictionary. This step is often specific to the use case but here’s a basic example.

# Function to create arrays
def create_arrays(months_dict):
    arrays = {}
    for month_name in months_dict.keys():
        arrays[month_name] = []
    return arrays

# Create arrays
arrays_dict = create_arrays(months_dict)
print(arrays_dict)

Step 5: Generate Custom Excel Files

We will generate custom Excel files with dynamic column names based on the dictionary.

from openpyxl import Workbook

# Function to create Excel files
def create_excel_files(months_dict):
    for month_name in months_dict.keys():
        wb = Workbook()
        ws = wb.active
        for col in range(1, 13):
            col_name = f'xyz_{months_dict[month_name]}_{col:02d}'
            ws.cell(row=1, column=col, value=col_name)
        wb.save(f'{month_name}.xlsx')

# Create Excel files
create_excel_files(months_dict)

Step 6: Define Functions for Common Data Quality Tests and Reporting

We will define a function for common data quality tests on a DataFrame and generate a report.

# Function for data quality tests
def data_quality_tests(df):
    results = {
        'missing_values': df.isnull().sum(),
        'duplicate_rows': df.duplicated().sum(),
        'summary_statistics': df.describe()
    }
    return results

# Function to generate DQ report
def generate_dq_report(df, months_dict):
    for month_name, ym in months_dict.items():
        year, month = ym.split('_')
        filtered_df = df[(df['date'].dt.year == int(year)) & (df['date'].dt.month == int(month))]
        dq_results = data_quality_tests(filtered_df)
        with open(f'{month_name}_dq_report.txt', 'w') as f:
            for key, value in dq_results.items():
                f.write(f'{key}:\n{value}\n\n')

# Generate DQ report
generate_dq_report(df, months_dict)

In this implementation:

  • We generate dynamic variable names for the next 24 months and store them in a dictionary.
  • We create CSV and Excel files based on the dictionary.
  • We define functions for common data quality tests and reporting.

This approach can be expanded and customized based on specific requirements for data analysis and reporting.

Another Intersting Coding example:-


Concept Building Examples:–

To dynamically generate a list of month-year combinations starting from the current month and going either forward or backward for the next/previous 12 months:-

To dynamically generate a list of month-year combinations starting from the current month and going either forward or backward for the next/previous 12 months, you can use Python’s datetime module. Here’s an example where the script dynamically calculates the months and years, stores them in a dictionary, and can be accessed in a similar way to SAS macros.

from datetime import datetime, timedelta
import calendar

# Function to generate dynamic months and years
def generate_month_years(start_date, months_to_generate=12, direction='forward'):
    months_years = {}
    current_date = start_date

    # Generate month-year combinations based on the direction (forward/backward)
    for i in range(1, months_to_generate + 1):
        month_year_key = f"month_year_{i}"
        month_year_value = f"{calendar.month_abbr[current_date.month]}{current_date.year}"
        months_years[month_year_key] = month_year_value
        
        # Move to next/previous month
        if direction == 'forward':
            current_date = (current_date.replace(day=1) + timedelta(days=32)).replace(day=1)  # Move to next month
        elif direction == 'backward':
            current_date = (current_date.replace(day=1) - timedelta(days=1)).replace(day=1)  # Move to previous month

    return months_years

# Get current date
start_date = datetime.now()

# Example: Generate next 12 months starting from the current month
months_years_forward = generate_month_years(start_date, months_to_generate=12, direction='forward')

# Example: Generate last 12 months
months_years_backward = generate_month_years(start_date, months_to_generate=12, direction='backward')

# Print forward months-year combinations
print("Next 12 months:")
for key, value in months_years_forward.items():
    print(f"{key}: {value}")

# Print backward months-year combinations
print("\nLast 12 months:")
for key, value in months_years_backward.items():
    print(f"{key}: {value}")
Next 12 months:
month_year_1: Sep2024
month_year_2: Oct2024
month_year_3: Nov2024
month_year_4: Dec2024
month_year_5: Jan2025
month_year_6: Feb2025
month_year_7: Mar2025
month_year_8: Apr2025
month_year_9: May2025
month_year_10: Jun2025
month_year_11: Jul2025
month_year_12: Aug2025

Last 12 months:
month_year_1: Sep2024
month_year_2: Aug2024
month_year_3: Jul2024
month_year_4: Jun2024
month_year_5: May2024
month_year_6: Apr2024
month_year_7: Mar2024
month_year_8: Feb2024
month_year_9: Jan2024
month_year_10: Dec2023
month_year_11: Nov2023
month_year_12: Oct2023

Project2:-

VarsMonth_periodMon_year
xyplus1202409Sep2024
xyplus2202410Oct2024
xyplus3202411Nov2024
xyplus4202412Dec2024
xyplus5202501Jan2025
xyplus6202502Feb2025
xyplus7202503Mar2025
xyplus8202504Apr2025
xyplus9202505May2025
xyplus10202506Jun2025
xyplus11202507Jul2025
xyplus12202508Aug2025
xyminus1202409Sep2024
xyminus2202408Aug2024
xyminus3202407Jul2024
xyminus4202406Jun2024
xyminus5202405May2024
xyminus6202404Apr2024
xyminus7202403Mar2024
xyminus8202402Feb2024
xyminus9202401Jan2024
xyminus10202312Dec2023
xyminus11202311Nov2023
xyminus12202310Oct2023
Aim is to create a dictionary that will hold all these vars and values for later use in a python script. xyplus1 will be based on the date value that is current or supplied by user and other vars will dynamically follow it. example uses of this dictionary, suppose i am reading a hive table which has mnth var which should be in next 12 months range where mnth is a string of values as ‘202409’ to ‘202308’
from datetime import datetime, timedelta
import calendar

# Function to generate month variables
def generate_date_vars(start_date=None, months_to_generate=12):
    if start_date is None:
        start_date = datetime.now()  # Default to current date
    
    # Prepare the dictionary to store the generated variables
    date_vars = {}

    # Loop to generate month-year values dynamically
    current_date = start_date
    for i in range(1, months_to_generate + 1):
        # Generate the key based on the index
        key = f"xyplus{i}"

        # Get the yyyyMM format
        yyyymm = current_date.strftime("%Y%m")
        
        # Get the readable format like "Sep2024"
        month_abbr = f"{calendar.month_abbr[current_date.month]}{current_date.year}"

        # Add these to the dictionary
        date_vars[key] = {"yyyymm": yyyymm, "readable": month_abbr}
        
        # Move to the next month
        current_date = (current_date.replace(day=1) + timedelta(days=32)).replace(day=1)

    return date_vars

# Example: Get date variables starting from the current date
date_variables = generate_date_vars()

# Display the dictionary
for key, value in date_variables.items():
    print(f"{key}: {value['yyyymm']}, {value['readable']}")
# Example usage of the dictionary in a Hive query
mnth_values = [val['yyyymm'] for val in date_variables.values()]
mnth_str = ', '.join([f"'{mnth}'" for mnth in mnth_values])

hive_query = f"SELECT * FROM my_table WHERE mnth IN ({mnth_str})"
print(hive_query)
SELECT * FROM my_table WHERE mnth IN ('202409', '202410', '202411', '202412', '202501', '202502', ...)

month_abbr = f”{calendar.month_abbr[current_date.month]}{current_date.year}” is only way to Get the readable format like “Sep2024”

In Python, using calendar.month_abbr is one way to get the abbreviated month name (like “Sep”). However, there are other ways to generate a readable format like “Sep2024”. Here are some alternatives:

1. Using strftime

You can use Python’s strftime() method, which gives you more flexibility and is a common way to format dates.

pythonCopy code# Using strftime to generate the format "Sep2024"
month_abbr = current_date.strftime("%b%Y")
  • %b: Abbreviated month name (e.g., “Sep”).
  • %Y: Year in four digits (e.g., “2024”).

This is a cleaner and often more flexible solution than using the calendar module.

2. Using locale for International Month Names

If you want to generate month names in a specific language (locale), you can use the locale module in conjunction with strftime.

import locale

# Set locale for international month abbreviations (example: German)
locale.setlocale(locale.LC_TIME, 'de_DE') # German locale

# Use strftime for localized month abbreviation and year
month_abbr = current_date.strftime("%b%Y") # For German: "Sep2024" will become "Sep2024"

3. Full Example Integrating with the Original Code:

Here’s how you could use strftime to generate both the yyyymm and readable Sep2024 formats:

from datetime import datetime, timedelta

# Function to generate month variables
def generate_date_vars(start_date=None, months_to_generate=12):
    if start_date is None:
        start_date = datetime.now()  # Default to current date
    
    # Prepare the dictionary to store the generated variables
    date_vars = {}

    # Loop to generate month-year values dynamically
    current_date = start_date
    for i in range(1, months_to_generate + 1):
        # Generate the key based on the index
        key = f"xyplus{i}"

        # Get the yyyyMM format
        yyyymm = current_date.strftime("%Y%m")
        
        # Get the readable format like "Sep2024" using strftime
        month_abbr = current_date.strftime("%b%Y")

        # Add these to the dictionary
        date_vars[key] = {"yyyymm": yyyymm, "readable": month_abbr}
        
        # Move to the next month
        current_date = (current_date.replace(day=1) + timedelta(days=32)).replace(day=1)

    return date_vars

# Example: Get date variables starting from the current date
date_variables = generate_date_vars()

# Display the dictionary
for key, value in date_variables.items():
    print(f"{key}: {value['yyyymm']}, {value['readable']}")

Output Example:

xyplus1: 202409, Sep2024
xyplus2: 202410, Oct2024
xyplus3: 202411, Nov2024
xyplus4: 202412, Dec2024
xyplus5: 202501, Jan2025
xyplus6: 202502, Feb2025
...
Let’s break down the line:
current_date = (current_date.replace(day=1) + timedelta(days=32)).replace(day=1)

Purpose:

This line is used to move to the next month while ensuring the day is set to the 1st of the month. It helps in scenarios where we want to jump to the next month, irrespective of the number of days in the current month (some months have 28, 30, or 31 days).

Detailed Breakdown:

  1. current_date.replace(day=1):
    • This part ensures that the day of the current_date is set to the 1st of the month.
    • Example: If current_date is 2024-09-15, this changes it to 2024-09-01.
    • This is important because we want to safely add days without accidentally rolling over into the wrong month.
  2. timedelta(days=32):
    • timedelta(days=32) adds 32 days to the current date.
    • Why 32? Since no month has more than 31 days, adding 32 days ensures we always jump to the next month, regardless of the length of the current month.
      • For example:
        • If the date is 2024-09-01, adding 32 days will move us to 2024-10-03, which is in October.
        • If the date is 2024-02-01 (February), adding 32 days will safely jump us into March.
  3. (...).replace(day=1):
    • After adding 32 days, we use .replace(day=1) again to reset the day to the 1st of the month.
    • This ensures that, regardless of the current date, we always move to the first day of the next month.
    • Example: If adding 32 days results in 2024-10-03, this final step will change it to 2024-10-01.

Why is this done?

  • The goal is to move to the next month, no matter how many days are in the current month (28, 30, or 31 days).
  • Using 32 days ensures we don’t miss the next month, and resetting the day to 1 (.replace(day=1)) ensures we always land on the first day of that month.

Example Walkthrough:

  • Starting Date: 2024-09-15Step-by-step:
    1. current_date.replace(day=1)2024-09-01
    2. + timedelta(days=32)2024-10-03 (moved 32 days forward)
    3. .replace(day=1)2024-10-01 (reset to the 1st day of the month)

So after the line executes, you will have moved to the first day of the next month, which in this example is October 1st, 2024.

Why not just add a month?

Python’s datetime module does not directly support adding months due to the varying number of days in each month. Using this method, however, allows us to reliably move to the next month without worrying about how many days the current month has.

suppose there is sales table we want to quickly find the avg of three months sales values ,8month back we need to get this avg. xyminus9, xyminus10, xyminus12 will be the monthperiod to consider . Now sales table has a field sales_date with values ’13-05-2024′ such -date as string, so how to work this out with help of our dictionary

To achieve this, we can use the dynamic month variables stored in the dictionary and filter the sales data based on those month periods (xyminus9, xyminus10, xyminus12). Here’s a breakdown of how you can accomplish this:

Steps:

  1. Create a dynamic dictionary that holds month and year combinations like xyminus9, xyminus10, etc.
  2. Extract month and year from the sales_date (which is stored as a string in the format 'DD-MM-YYYY').
  3. Filter the sales table based on the month periods you want (xyminus9, xyminus10, xyminus12).
  4. Calculate the average sales for these filtered records.

Implementation:

  1. Generate the dictionary (similar to what we did earlier): We’ll generate variables like xyminus1, xyminus2, and so on, holding the yyyymm format.
  2. Query the sales table using the dictionary values for the months you want (xyminus9, xyminus10, xyminus12).

Here’s a Python example to demonstrate:

from datetime import datetime, timedelta

# Function to generate month variables for the past months
def generate_past_date_vars(start_date=None, months_to_generate=12):
    if start_date is None:
        start_date = datetime.now()  # Default to current date
    
    # Prepare the dictionary to store the generated variables
    date_vars = {}

    # Loop to generate past month-year values dynamically
    current_date = start_date
    for i in range(1, months_to_generate + 1):
        # Generate the key based on the index (negative to represent past)
        key = f"xyminus{i}"

        # Get the yyyyMM format
        yyyymm = current_date.strftime("%Y%m")

        # Add to the dictionary
        date_vars[key] = yyyymm
        
        # Move to the previous month
        current_date = (current_date.replace(day=1) - timedelta(days=1)).replace(day=1)

    return date_vars

# Example: Generate last 12 months' date variables starting from today
past_date_vars = generate_past_date_vars()

# Now the dictionary will contain values like:
# xyminus1: '202409', xyminus2: '202408', ..., xyminus9: '202401'

# Print the months of interest
print("Months of interest:")
print(past_date_vars['xyminus9'], past_date_vars['xyminus10'], past_date_vars['xyminus12'])

Sample Output for the Dictionary:

If the current date is September 2024, you would get:

vbnetCopy codexyminus9: '202401'
xyminus10: '202312'
xyminus12: '202310'

Now, to filter your sales table:

Assuming you have a sales table with a sales_date field formatted as DD-MM-YYYY, you need to:

  1. Convert the sales_date into yyyymm format.
  2. Filter the records for xyminus9, xyminus10, and xyminus12.
  3. Calculate the average of the sales.

Here’s an example using pandas:

import pandas as pd

# Sample sales data (assuming 'sales_date' is a string in 'DD-MM-YYYY' format)
data = {'sales_date': ['13-05-2024', '20-12-2023', '15-01-2024', '01-10-2023'],
        'sales_amount': [1000, 1500, 1200, 1800]}
sales_df = pd.DataFrame(data)

# Convert 'sales_date' to datetime format and then extract 'yyyymm'
sales_df['sales_date'] = pd.to_datetime(sales_df['sales_date'], format='%d-%m-%Y')
sales_df['yyyymm'] = sales_df['sales_date'].dt.strftime('%Y%m')

# Define the months of interest (xyminus9, xyminus10, xyminus12)
months_of_interest = [past_date_vars['xyminus9'], past_date_vars['xyminus10'], past_date_vars['xyminus12']]

# Filter the sales data for these months
filtered_sales = sales_df[sales_df['yyyymm'].isin(months_of_interest)]

# Calculate the average sales for the filtered months
average_sales = filtered_sales['sales_amount'].mean()

# Display the result
print("Average sales for the months of interest:", average_sales)

Explanation:

  1. Convert sales_date to yyyymm format: This allows you to match the sales data with the months in your dictionary.
  2. Filter the sales_df DataFrame: Using the isin() function, filter rows where the yyyymm value matches xyminus9, xyminus10, or xyminus12.
  3. Calculate the average: Use the .mean() function to compute the average sales amount for the selected rows.

Expected Output:

If the sales data has matching records for the months xyminus9, xyminus10, and xyminus12, you will get the average sales value for those months.

Conclusion:

  • You can easily use the dictionary (past_date_vars) to dynamically filter the relevant sales data.
  • The process involves converting sales_date to a comparable format (yyyymm), filtering based on the desired months, and calculating the average.

here are a few additional methods we can consider for dynamically working with dates and filtering sales data.

1. Using SQL Query with a Hive Table

If you are using Hive for querying your sales data, you can create a similar solution directly in SQL. Here’s an example of how you can filter your Hive table for specific months dynamically:

Steps:

  1. Create a dynamic date range using Python or SQL.
  2. Filter Hive table using yyyymm format in SQL.
  3. Aggregate and calculate average sales within the SQL query.

SQL Example:

Assume the sales table has a sales_date stored as a string in the format 'DD-MM-YYYY':

WITH sales_data AS (
    SELECT
        sales_date,
        sales_amount,
        from_unixtime(unix_timestamp(sales_date, 'dd-MM-yyyy'), 'yyyyMM') AS yyyymm
    FROM sales_table
)
SELECT
    AVG(sales_amount) AS avg_sales
FROM
    sales_data
WHERE
    yyyymm IN ('202401', '202312', '202310');  -- These are the months xyminus9, xyminus10, and xyminus12

Explanation:

  • from_unixtime(): Converts your string sales_date to a Unix timestamp, then formats it as yyyymm.
  • WHERE clause: Filters based on the dynamic month values (xyminus9, xyminus10, etc.), which can be passed as parameters.
  • AVG(): Aggregates and calculates the average sales.

This approach offloads the computation to Hive, making it more efficient for large datasets.


2. Using Python’s dateutil.relativedelta for Date Arithmetic

If you prefer using a library that handles date arithmetic in a more readable way, the dateutil library is a great alternative to manage relative date calculations.

from datetime import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd

# Initialize the current date
current_date = datetime.now()

# Function to generate previous months dynamically using relativedelta
def get_previous_months(num_months):
    months = {}
    for i in range(1, num_months + 1):
        month_date = current_date - relativedelta(months=i)
        months[f"xyminus{i}"] = month_date.strftime("%Y%m")
    return months

# Generate the last 12 months' yyyymm
past_date_vars = get_previous_months(12)
print(past_date_vars)

# Example to filter sales data using pandas DataFrame
sales_data = {'sales_date': ['13-05-2024', '15-01-2024', '01-10-2023'],
              'sales_amount': [1000, 1200, 1800]}
sales_df = pd.DataFrame(sales_data)

# Convert 'sales_date' to yyyymm
sales_df['sales_date'] = pd.to_datetime(sales_df['sales_date'], format='%d-%m-%Y')
sales_df['yyyymm'] = sales_df['sales_date'].dt.strftime('%Y%m')

# Filter for xyminus9, xyminus10, xyminus12
months_of_interest = [past_date_vars['xyminus9'], past_date_vars['xyminus10'], past_date_vars['xyminus12']]
filtered_sales = sales_df[sales_df['yyyymm'].isin(months_of_interest)]

# Calculate average sales
average_sales = filtered_sales['sales_amount'].mean()
print("Average Sales:", average_sales)

Key Benefits:

  • relativedelta is more intuitive for manipulating dates by months, years, etc.
  • It allows you to easily handle dynamic month generation and is especially useful for more complex date calculations.

3. Using Hive Window Functions

If you want to compute rolling averages or other aggregations directly in your Hive table, you can use window functions. This is particularly useful when you need to compute averages over a sliding window of months.

Example SQL Query:

WITH sales_data AS (
    SELECT
        sales_date,
        sales_amount,
        from_unixtime(unix_timestamp(sales_date, 'dd-MM-yyyy'), 'yyyyMM') AS yyyymm
    FROM sales_table
)
SELECT
    yyyymm,
    sales_amount,
    AVG(sales_amount) OVER (PARTITION BY sales_table ORDER BY yyyymm ROWS BETWEEN 9 PRECEDING AND 9 FOLLOWING) AS avg_sales
FROM sales_data
WHERE
    yyyymm IN ('202401', '202312', '202310');

Explanation:

  • Window function (OVER): Calculates a rolling average over a window of months.
  • This is useful if you need to calculate averages not only for a fixed set of months but for a rolling period.

4. Using PySpark for Large-Scale Data Processing

For large datasets, PySpark can be a great option for distributed data processing. You can easily handle the dynamic date generation using Python, and then apply transformations on your sales data.

Here’s a simplified PySpark example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import date_format, col

# Create Spark session
spark = SparkSession.builder.appName("sales_avg").getOrCreate()

# Sample sales data
data = [("13-05-2024", 1000), ("20-12-2023", 1500), ("15-01-2024", 1200), ("01-10-2023", 1800)]
columns = ["sales_date", "sales_amount"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Convert sales_date to proper date type and extract yyyymm
df = df.withColumn("sales_date", date_format("sales_date", "dd-MM-yyyy")) \
       .withColumn("yyyymm", date_format("sales_date", "yyyyMM"))

# Define the months of interest
months_of_interest = ['202401', '202312', '202310']

# Filter based on dynamic months
df_filtered = df.filter(col("yyyymm").isin(months_of_interest))

# Calculate average sales
df_filtered.groupBy().avg("sales_amount").show()

Benefits of Using PySpark:

  • Scalable: Can process very large datasets.
  • Distributed: Leverages the power of distributed computing for efficient filtering and aggregations.

5. Using Pandas with Date Ranges

You can leverage Pandas to create a range of dates and filter your sales data using these ranges.

import pandas as pd

# Define the range of dates for the last 12 months
today = pd.to_datetime("today")
date_range = pd.date_range(end=today, periods=12, freq='M').strftime('%Y%m')

# Example sales DataFrame
sales_data = {'sales_date': ['13-05-2024', '20-12-2023', '15-01-2024', '01-10-2023'],
              'sales_amount': [1000, 1500, 1200, 1800]}
sales_df = pd.DataFrame(sales_data)

# Convert sales_date to yyyymm
sales_df['sales_date'] = pd.to_datetime(sales_df['sales_date'], format='%d-%m-%Y')
sales_df['yyyymm'] = sales_df['sales_date'].dt.strftime('%Y%m')

# Filter the sales data for the last 12 months
filtered_sales = sales_df[sales_df['yyyymm'].isin(date_range)]

# Calculate average sales
average_sales = filtered_sales['sales_amount'].mean()
print("Average Sales:", average_sales)

Conclusion:

Depending on your specific environment and dataset size, you can choose:

  • SQL or Hive queries if you want to compute averages directly on the server.
  • Pandas or PySpark if you’re working with Python and handling data in-memory or in a distributed fashion.
  • relativedelta and date arithmetic libraries to simplify dynamic month generation.

Discover more from AI HintsToday

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Latest Entries:-

  • Data Engineering Job Interview Questions :- Datawarehouse Terms
  • Oracle Query Execution phases- How query flows?
  • Pyspark -Introduction, Components, Compared With Hadoop
  • PySpark Architecture- (Driver- Executor) , Web Interface
  • Memory Management through Hadoop Traditional map reduce vs Pyspark- explained with example of Complex data pipeline used for Both used
  • Example Spark submit command used in very complex etl Jobs
  • Deploying a PySpark job- Explain Various Methods and Processes Involved
  • What is Hive?
  • In How many ways pyspark script can be executed? Detailed explanation
  • DAG Scheduler in Spark: Detailed Explanation, How it is involved at architecture Level
  • CPU Cores, executors, executor memory in pyspark- Expalin Memory Management in Pyspark
  • Pyspark- Jobs , Stages and Tasks explained
  • A DAG Stage in Pyspark is divided into tasks based on the partitions of the data. How these partitions are decided?
  • Apache Spark- Partitioning and Shuffling
  • Discuss Spark Data Types, Spark Schemas- How Sparks infers Schema?
  • String Data Manipulation and Data Cleaning in Pyspark

Discover more from AI HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading