Python provides various libraries and functions to manipulate dates and times. Here are some common operations:
DateTime Library
The datetime
library is the primary library for date and time manipulation in Python.
- datetime.date: Represents a date (year, month, day)
- datetime.time: Represents a time (hour, minute, second, microsecond)
- datetime.datetime: Represents a date and time
Visual Representation
+---------------+
| datetime |
+---------------+
|
|
v
+---------------+---------------+---------------+
| date | time | datetime |
+---------------+---------------+---------------+
| (year, | (hour, | (year, |
| month, day) | minute, | month, day,|
| | second, | hour, minute,|
| | microsecond) | second, |
| | | microsecond) |
+---------------+---------------+---------------+
Date and Time Operations
Here are some common date and time operations:
- Current Date and Time
datetime.datetime.now()
Returns the current date and time.
- Create Date Object
datetime.date(year, month, day)
Creates a date object.
- Create Time Object
datetime.time(hour, minute, second, microsecond)
Creates a time object.
- Create DateTime Object
datetime.datetime(year, month, day, hour, minute, second, microsecond)
Creates a date and time object.
- Date and Time Formatting
strftime()
Formats a date and time object as a string.
Example: datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
- Date and Time Parsing
strptime()
Parses a string into a date and time object.
Example: datetime.datetime.strptime("2022-01-01 12:00:00", "%Y-%m-%d %H:%M:%S")
- Date Arithmetic
timedelta
Represents a duration, the difference between two dates or times.
Example: datetime.datetime.now() + datetime.timedelta(days=1)
- Timezone Handling
pytz
Provides timezone handling functionality.
Example: datetime.datetime.now(pytz.timezone("US/Eastern"))
Visual Representation
+---------------+
| datetime |
+---------------+
|
|
v
+---------------+---------------+---------------+
| now() | date() | time() |
+---------------+---------------+---------------+
| (current | (year, | (hour, |
| date and | month, day) | minute, |
| time) | | second, |
| | | microsecond) |
+---------------+---------------+---------------+
|
|
v
+---------------+---------------+---------------+
| strftime() | strptime() | timedelta |
+---------------+---------------+---------------+
| (format | (parse | (duration |
| date and | string) | between |
| time) | | dates or |
| | | times) |
+---------------+---------------+---------------+
|
|
v
+---------------+---------------+
| pytz |
+---------------+
| (timezone |
| handling) |
+---------------+
The datetime
library is the most commonly used library for date and time manipulation.
Key Functions
datetime.datetime.now()
: Get the current date and time.datetime.datetime.strptime(date_string, format)
: Parse a string into a datetime object.datetime.datetime.strftime(format)
: Format a datetime object as a string.datetime.timedelta()
: Represent a duration, useful for date arithmetic.datetime.datetime.combine()
: Combine date and time into a single datetime object.
time
Library
The time
library provides functions for working with time in seconds since the epoch (Unix time).
Key Functions
time.time()
: Get the current time in seconds since the epoch.time.sleep(seconds)
: Pause the program for the specified number of seconds.
Example
import time
# Get the current time in seconds since the epoch
epoch_time = time.time()
print("Current time (epoch):", epoch_time)
# Pause the program for 5 seconds
print("Sleeping for 5 seconds...")
time.sleep(5)
print("Awake!")
calendar
Library
The calendar
library provides functions related to the calendar.
Key Functions
calendar.month(year, month)
: Get a string representing the month’s calendar.calendar.isleap(year)
: Check if a year is a leap year.
Example
import calendar
# Print the calendar for July 2024
july_calendar = calendar.month(2024, 7)
print(july_calendar)
# Check if a year is a leap year
is_leap = calendar.isleap(2024)
print("Is 2024 a leap year?", is_leap)
dateutil
Library
The dateutil
library extends the datetime
library with additional features.
Key Functions
dateutil.parser.parse(date_string)
: Parse a date string into a datetime object.dateutil.relativedelta.relativedelta()
: Represent relative time differences, such as months or years.
Example
from dateutil import parser, relativedelta
# Parse a date string
date_string = "24 July, 2024"
parsed_date = parser.parse(date_string)
print("Parsed date:", parsed_date)
# Add 1 month and 10 days to a date
future_date = parsed_date + relativedelta.relativedelta(months=1, days=10)
print("Future date:", future_date)
pytz
Library
The pytz
library allows accurate and cross-platform timezone calculations.
Key Functions
pytz.timezone(zone)
: Get a timezone object.datetime.datetime.now(pytz.timezone)
: Get the current time in a specific timezone.
Example
from datetime import datetime
import pytz
# Get the current time in UTC
utc_time = datetime.now(pytz.utc)
print("Current UTC time:", utc_time)
# Convert UTC time to Eastern Time
eastern = pytz.timezone('US/Eastern')
eastern_time = utc_time.astimezone(eastern)
print("Eastern Time:", eastern_time)
Date and Time manipulation operations
Here are some common date and time manipulation operations:
Importing the Required Modules
from datetime import datetime, date, time, timedelta
import time as tm
import calendar
1. Getting the Current Date and Time
# Current date and time
now = datetime.now()
print("Current Date and Time:", now)
# Current date
today = date.today()
print("Current Date:", today)
# Current time
current_time = now.time()
print("Current Time:", current_time)
2. Formatting Dates and Times
# Formatting datetime
formatted_now = now.strftime("%Y-%m-%d %H:%M:%S")
print("Formatted Date and Time:", formatted_now)
# Formatting date
formatted_date = today.strftime("%d-%m-%Y")
print("Formatted Date:", formatted_date)
# Parsing date from string
parsed_date = datetime.strptime("2024-07-24", "%Y-%m-%d")
print("Parsed Date:", parsed_date)
3. Date and Time Arithmetic
# Adding and subtracting days
tomorrow = today + timedelta(days=1)
yesterday = today - timedelta(days=1)
print("Tomorrow's Date:", tomorrow)
print("Yesterday's Date:", yesterday)
# Adding hours, minutes, seconds
new_time = now + timedelta(hours=2, minutes=30)
print("New Time after adding 2 hours and 30 minutes:", new_time)
# Difference between two dates
date_diff = tomorrow - yesterday
print("Difference between Tomorrow and Yesterday:", date_diff)
4. Getting Day, Month, Year, Weekday
# Getting day, month, year
day = today.day
month = today.month
year = today.year
print("Day:", day, "Month:", month, "Year:", year)
# Getting weekday (0: Monday, 6: Sunday)
weekday = today.weekday()
print("Weekday:", weekday)
# Getting ISO weekday (1: Monday, 7: Sunday)
iso_weekday = today.isoweekday()
print("ISO Weekday:", iso_weekday)
5. Time Operations
# Current time in seconds since the epoch
epoch_time = tm.time()
print("Current Time in seconds since the epoch:", epoch_time)
# Sleep for a few seconds
print("Sleeping for 2 seconds...")
tm.sleep(2)
print("Awake now!")
6. Working with Calendars
# Printing a month's calendar
print(calendar.month(2024, 7))
# Printing a year's calendar
print(calendar.calendar(2024))
# Checking if a year is a leap year
is_leap = calendar.isleap(2024)
print("Is 2024 a leap year?", is_leap)
7. Creating and Manipulating Time Objects
# Creating a time object
t = time(14, 30, 45) # 14:30:45
print("Time Object:", t)
# Accessing hour, minute, second
print("Hour:", t.hour)
print("Minute:", t.minute)
print("Second:", t.second)
8. Combining Date and Time
# Combining date and time into a datetime object
combined = datetime.combine(today, t)
print("Combined Date and Time:", combined)
Example: Working with Timezones
To handle time zones, you can use the pytz
library.
import pytz
# Get the current time in UTC
utc_now = datetime.now(pytz.utc)
print("Current UTC Time:", utc_now)
# Convert UTC time to a different timezone
eastern = pytz.timezone('US/Eastern')
eastern_time = utc_now.astimezone(eastern)
print("Eastern Time:", eastern_time)
Above we discussed about Python Date and Time specific libraries. But do the most used Pandas have date functions?
Yes, pandas has its own set of date manipulation functions, which are built on top of the Python datetime
module. These functions provide additional functionality and convenience for working with dates in pandas DataFrames.
Some key pandas date manipulation functions include:
Date parsing and formatting
pd.to_datetime()
: Convert strings to datetime objectspd.to_date()
: Convert datetime objects to datesdt.strftime()
: Format datetime objects as strings
Date arithmetic
dt.add()
: Add timedeltas to datetime objectsdt.subtract()
: Subtract timedeltas from datetime objectsdt.floor()
: Round down to the nearest time unit (e.g., minute, hour, day)dt.ceil()
: Round up to the nearest time unit
Date extraction
dt.year
: Extract the year from a datetime objectdt.month
: Extract the month from a datetime objectdt.day
: Extract the day from a datetime objectdt.hour
: Extract the hour from a datetime objectdt.minute
: Extract the minute from a datetime objectdt.second
: Extract the second from a datetime object
Time delta functions
pd.Timedelta()
: Create a timedelta objectdt.total_seconds()
: Convert a timedelta to seconds
Resampling and frequency functions
resample()
: Resample data by time frequencyasfreq()
: Convert data to a specified frequency
Holiday and business day functions
pd.holiday()
: Create a holiday calendarpd.BusinessDay()
: Create a business day calendarpd.DateOffset()
: Create a date offset object
pandas also provides several date-related indexers and selectors, such as:
pd.date_range()
: Create a date range indexpd.period_range()
: Create a period range indexpd.DatetimeIndex()
: Create a datetime index
These functions provide a more convenient and efficient way to work with dates in pandas DataFrames compared to using the Python datetime
module alone.
Example:
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'date': ['2022-01-01', '2022-01-15', '2022-02-01']})
df['date'] = pd.to_datetime(df['date'])
# Extract the year and month
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# Add 3 days to the date
df['new_date'] = df['date'] + pd.Timedelta(days=3)
print(df)
Output:
date year month new_date
0 2022-01-01 2022 1 2022-01-04
1 2022-01-15 2022 1 2022-01-18
2 2022-02-01 2022 2 2022-02-04
here comes another question:- pd.Timedelta(days=3) is pandas inbuilt or do we have to call datetime?
so in above case, pd.Timedelta(days=3)
is a pandas built-in function.
pd.Timedelta
is a pandas-specific class that represents a duration, which is a difference between two dates or times.
While datetime.timedelta
is a Python built-in class from the datetime
module.
Both pd.Timedelta
and datetime.timedelta
can be used to represent time durations, but they have some differences:
pandas pd.Timedelta
:
- Designed specifically for pandas date and time operations.
- Supports pandas-specific frequency codes (e.g., ‘D’, ‘B’, ‘H’, ‘T’, ‘S’).
- Integrates well with pandas DataFrames and Series.
Python datetime.timedelta
:
- Part of the Python
datetime
module. - More general-purpose, can be used outside of pandas.
- Does not support pandas-specific frequency codes.
You can use either pd.Timedelta
or datetime.timedelta
depending on your specific use case:
import pandas as pd
import datetime
# pandas Timedelta
pd_timedelta = pd.Timedelta(days=3)
print(type(pd_timedelta)) # <class 'pandas._libs.tslibs.timedeltas.Timedelta'>
# datetime timedelta
dt_timedelta = datetime.timedelta(days=3)
print(type(dt_timedelta)) # <class 'datetime.timedelta'>
In the example you provided, pd.Timedelta(days=3)
is used to add 3 days to the date column in a pandas DataFrame.
You can achieve the same result using datetime.timedelta
:
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['2022-01-01']})
df['date'] = pd.to_datetime(df['date'])
df['new_date'] = df['date'] + datetime.timedelta(days=3)
print(df)
Interesting Use Cases
- Automated Reports: Use
datetime
to schedule daily, weekly, or monthly reports. - Reminder Systems: Use
time.sleep()
to create a simple reminder system that notifies the user at specified intervals. - Timezone Conversions: Use
pytz
to handle internationalization of applications by converting times between different timezones. - Scheduling Tasks: Combine
datetime
anddateutil.relativedelta
to schedule tasks to run at specific intervals, like every first Monday of the month. - Leap Year Calculations: Use
calendar.isleap()
to handle leap year specific logic in date calculations.
Example: Automating a Daily Task
import datetime
import time
def daily_task():
print("Performing daily task...")
# Schedule the task to run at 8:00 AM every day
target_time = datetime.time(hour=8, minute=0, second=0)
while True:
now = datetime.datetime.now()
current_time = now.time()
if current_time >= target_time:
daily_task()
# Sleep until the same time next day
time_to_sleep = 24*60*60 - (now - now.replace(hour=8, minute=0, second=0)).seconds
else:
# Sleep until target time today
time_to_sleep = (datetime.datetime.combine(now.date(), target_time) - now).seconds
time.sleep(time_to_sleep)
This example demonstrates how to schedule a daily task to run at 8:00 AM every day using the datetime
and time
libraries.
A example project:-
- Generate Dynamic Variable Names for the Next 24 Months
- Store the Variable Names in a Dictionary
- Use the Dictionary to Create CSV Files
- Create Arrays Based on the Dictionary
- Generate Custom Excel Files
- Define Functions for Common Data Quality Tests and Reporting
Here is a step-by-step implementation in Python:
Step 1: Generate Dynamic Variable Names for the Next 24 Months
We will generate dynamic variable names based on the current date and store them in a dictionary.
import pandas as pd
from datetime import datetime, timedelta
import os
# Function to generate the next 24 months
def generate_months():
current_date = datetime.now()
months = {}
for i in range(24):
month_year = current_date.strftime("%Y%m")
months[f'Month_{month_year}'] = f'{current_date.year}_{current_date.month:02d}'
current_date += timedelta(days=30)
return months
# Generate the months
months_dict = generate_months()
print(months_dict)
Step 2: Store the Variable Names in a Dictionary
We store the generated month variable names in a dictionary for further use.
Step 3: Use the Dictionary to Create CSV Files
We will create CSV files based on the dictionary. For simplicity, let’s assume we have a sample DataFrame that we filter by year and month to create CSV files.
# Sample DataFrame
data = {
'date': pd.date_range(start='2023-01-01', periods=365, freq='D'),
'value': range(365)
}
df = pd.DataFrame(data)
# Function to create CSV files for each month
def create_csv_files(df, months_dict):
for month_name, ym in months_dict.items():
year, month = ym.split('_')
filtered_df = df[(df['date'].dt.year == int(year)) & (df['date'].dt.month == int(month))]
filtered_df.to_csv(f'{month_name}.csv', index=False)
# Create CSV files
create_csv_files(df, months_dict)
Step 4: Create Arrays Based on the Dictionary
We can create arrays based on the dictionary. This step is often specific to the use case but here’s a basic example.
# Function to create arrays
def create_arrays(months_dict):
arrays = {}
for month_name in months_dict.keys():
arrays[month_name] = []
return arrays
# Create arrays
arrays_dict = create_arrays(months_dict)
print(arrays_dict)
Step 5: Generate Custom Excel Files
We will generate custom Excel files with dynamic column names based on the dictionary.
from openpyxl import Workbook
# Function to create Excel files
def create_excel_files(months_dict):
for month_name in months_dict.keys():
wb = Workbook()
ws = wb.active
for col in range(1, 13):
col_name = f'xyz_{months_dict[month_name]}_{col:02d}'
ws.cell(row=1, column=col, value=col_name)
wb.save(f'{month_name}.xlsx')
# Create Excel files
create_excel_files(months_dict)
Step 6: Define Functions for Common Data Quality Tests and Reporting
We will define a function for common data quality tests on a DataFrame and generate a report.
# Function for data quality tests
def data_quality_tests(df):
results = {
'missing_values': df.isnull().sum(),
'duplicate_rows': df.duplicated().sum(),
'summary_statistics': df.describe()
}
return results
# Function to generate DQ report
def generate_dq_report(df, months_dict):
for month_name, ym in months_dict.items():
year, month = ym.split('_')
filtered_df = df[(df['date'].dt.year == int(year)) & (df['date'].dt.month == int(month))]
dq_results = data_quality_tests(filtered_df)
with open(f'{month_name}_dq_report.txt', 'w') as f:
for key, value in dq_results.items():
f.write(f'{key}:n{value}nn')
# Generate DQ report
generate_dq_report(df, months_dict)
In this implementation:
- We generate dynamic variable names for the next 24 months and store them in a dictionary.
- We create CSV and Excel files based on the dictionary.
- We define functions for common data quality tests and reporting.
This approach can be expanded and customized based on specific requirements for data analysis and reporting.
Another Intersting Coding example:-
Concept Building Examples:–
To dynamically generate a list of month-year combinations starting from the current month and going either forward or backward for the next/previous 12 months:-
To dynamically generate a list of month-year combinations starting from the current month and going either forward or backward for the next/previous 12 months, you can use Python’s datetime
module. Here’s an example where the script dynamically calculates the months and years, stores them in a dictionary, and can be accessed in a similar way to SAS macros.
from datetime import datetime, timedelta
import calendar
# Function to generate dynamic months and years
def generate_month_years(start_date, months_to_generate=12, direction='forward'):
months_years = {}
current_date = start_date
# Generate month-year combinations based on the direction (forward/backward)
for i in range(1, months_to_generate + 1):
month_year_key = f"month_year_{i}"
month_year_value = f"{calendar.month_abbr[current_date.month]}{current_date.year}"
months_years[month_year_key] = month_year_value
# Move to next/previous month
if direction == 'forward':
current_date = (current_date.replace(day=1) + timedelta(days=32)).replace(day=1) # Move to next month
elif direction == 'backward':
current_date = (current_date.replace(day=1) - timedelta(days=1)).replace(day=1) # Move to previous month
return months_years
# Get current date
start_date = datetime.now()
# Example: Generate next 12 months starting from the current month
months_years_forward = generate_month_years(start_date, months_to_generate=12, direction='forward')
# Example: Generate last 12 months
months_years_backward = generate_month_years(start_date, months_to_generate=12, direction='backward')
# Print forward months-year combinations
print("Next 12 months:")
for key, value in months_years_forward.items():
print(f"{key}: {value}")
# Print backward months-year combinations
print("nLast 12 months:")
for key, value in months_years_backward.items():
print(f"{key}: {value}")
Next 12 months:
month_year_1: Sep2024
month_year_2: Oct2024
month_year_3: Nov2024
month_year_4: Dec2024
month_year_5: Jan2025
month_year_6: Feb2025
month_year_7: Mar2025
month_year_8: Apr2025
month_year_9: May2025
month_year_10: Jun2025
month_year_11: Jul2025
month_year_12: Aug2025
Last 12 months:
month_year_1: Sep2024
month_year_2: Aug2024
month_year_3: Jul2024
month_year_4: Jun2024
month_year_5: May2024
month_year_6: Apr2024
month_year_7: Mar2024
month_year_8: Feb2024
month_year_9: Jan2024
month_year_10: Dec2023
month_year_11: Nov2023
month_year_12: Oct2023
Project2:-
Vars | Month_period | Mon_year | |
xyplus1 | 202409 | Sep2024 | |
xyplus2 | 202410 | Oct2024 | |
xyplus3 | 202411 | Nov2024 | |
xyplus4 | 202412 | Dec2024 | |
xyplus5 | 202501 | Jan2025 | |
xyplus6 | 202502 | Feb2025 | |
xyplus7 | 202503 | Mar2025 | |
xyplus8 | 202504 | Apr2025 | |
xyplus9 | 202505 | May2025 | |
xyplus10 | 202506 | Jun2025 | |
xyplus11 | 202507 | Jul2025 | |
xyplus12 | 202508 | Aug2025 | |
xyminus1 | 202409 | Sep2024 | |
xyminus2 | 202408 | Aug2024 | |
xyminus3 | 202407 | Jul2024 | |
xyminus4 | 202406 | Jun2024 | |
xyminus5 | 202405 | May2024 | |
xyminus6 | 202404 | Apr2024 | |
xyminus7 | 202403 | Mar2024 | |
xyminus8 | 202402 | Feb2024 | |
xyminus9 | 202401 | Jan2024 | |
xyminus10 | 202312 | Dec2023 | |
xyminus11 | 202311 | Nov2023 | |
xyminus12 | 202310 | Oct2023 |
from datetime import datetime, timedelta
import calendar
# Function to generate month variables
def generate_date_vars(start_date=None, months_to_generate=12):
if start_date is None:
start_date = datetime.now() # Default to current date
# Prepare the dictionary to store the generated variables
date_vars = {}
# Loop to generate month-year values dynamically
current_date = start_date
for i in range(1, months_to_generate + 1):
# Generate the key based on the index
key = f"xyplus{i}"
# Get the yyyyMM format
yyyymm = current_date.strftime("%Y%m")
# Get the readable format like "Sep2024"
month_abbr = f"{calendar.month_abbr[current_date.month]}{current_date.year}"
# Add these to the dictionary
date_vars[key] = {"yyyymm": yyyymm, "readable": month_abbr}
# Move to the next month
current_date = (current_date.replace(day=1) + timedelta(days=32)).replace(day=1)
return date_vars
# Example: Get date variables starting from the current date
date_variables = generate_date_vars()
# Display the dictionary
for key, value in date_variables.items():
print(f"{key}: {value['yyyymm']}, {value['readable']}")
# Example usage of the dictionary in a Hive query
mnth_values = [val['yyyymm'] for val in date_variables.values()]
mnth_str = ', '.join([f"'{mnth}'" for mnth in mnth_values])
hive_query = f"SELECT * FROM my_table WHERE mnth IN ({mnth_str})"
print(hive_query)
SELECT * FROM my_table WHERE mnth IN ('202409', '202410', '202411', '202412', '202501', '202502', ...)
month_abbr = f”{calendar.month_abbr[current_date.month]}{current_date.year}” is only way to Get the readable format like “Sep2024”
In Python, using calendar.month_abbr
is one way to get the abbreviated month name (like “Sep”). However, there are other ways to generate a readable format like “Sep2024”. Here are some alternatives:
1. Using strftime
You can use Python’s strftime()
method, which gives you more flexibility and is a common way to format dates.
pythonCopy code# Using strftime to generate the format "Sep2024"
month_abbr = current_date.strftime("%b%Y")
%b
: Abbreviated month name (e.g., “Sep”).%Y
: Year in four digits (e.g., “2024”).
This is a cleaner and often more flexible solution than using the calendar
module.
2. Using locale
for International Month Names
If you want to generate month names in a specific language (locale), you can use the locale
module in conjunction with strftime
.
import locale
# Set locale for international month abbreviations (example: German)
locale.setlocale(locale.LC_TIME, 'de_DE') # German locale
# Use strftime for localized month abbreviation and year
month_abbr = current_date.strftime("%b%Y") # For German: "Sep2024" will become "Sep2024"
3. Full Example Integrating with the Original Code:
Here’s how you could use strftime
to generate both the yyyymm
and readable Sep2024
formats:
from datetime import datetime, timedelta
# Function to generate month variables
def generate_date_vars(start_date=None, months_to_generate=12):
if start_date is None:
start_date = datetime.now() # Default to current date
# Prepare the dictionary to store the generated variables
date_vars = {}
# Loop to generate month-year values dynamically
current_date = start_date
for i in range(1, months_to_generate + 1):
# Generate the key based on the index
key = f"xyplus{i}"
# Get the yyyyMM format
yyyymm = current_date.strftime("%Y%m")
# Get the readable format like "Sep2024" using strftime
month_abbr = current_date.strftime("%b%Y")
# Add these to the dictionary
date_vars[key] = {"yyyymm": yyyymm, "readable": month_abbr}
# Move to the next month
current_date = (current_date.replace(day=1) + timedelta(days=32)).replace(day=1)
return date_vars
# Example: Get date variables starting from the current date
date_variables = generate_date_vars()
# Display the dictionary
for key, value in date_variables.items():
print(f"{key}: {value['yyyymm']}, {value['readable']}")
Output Example:
xyplus1: 202409, Sep2024
xyplus2: 202410, Oct2024
xyplus3: 202411, Nov2024
xyplus4: 202412, Dec2024
xyplus5: 202501, Jan2025
xyplus6: 202502, Feb2025
...
Let’s break down the line:
current_date = (current_date.replace(day=1) + timedelta(days=32)).replace(day=1)
Purpose:
This line is used to move to the next month while ensuring the day is set to the 1st of the month. It helps in scenarios where we want to jump to the next month, irrespective of the number of days in the current month (some months have 28, 30, or 31 days).
Detailed Breakdown:
current_date.replace(day=1)
:- This part ensures that the day of the
current_date
is set to the 1st of the month. - Example: If
current_date
is2024-09-15
, this changes it to2024-09-01
. - This is important because we want to safely add days without accidentally rolling over into the wrong month.
- This part ensures that the day of the
timedelta(days=32)
:timedelta(days=32)
adds 32 days to the current date.- Why 32? Since no month has more than 31 days, adding 32 days ensures we always jump to the next month, regardless of the length of the current month.
- For example:
- If the date is
2024-09-01
, adding 32 days will move us to2024-10-03
, which is in October. - If the date is
2024-02-01
(February), adding 32 days will safely jump us into March.
- If the date is
- For example:
(...).replace(day=1)
:- After adding 32 days, we use
.replace(day=1)
again to reset the day to the 1st of the month. - This ensures that, regardless of the current date, we always move to the first day of the next month.
- Example: If adding 32 days results in
2024-10-03
, this final step will change it to2024-10-01
.
- After adding 32 days, we use
Why is this done?
- The goal is to move to the next month, no matter how many days are in the current month (28, 30, or 31 days).
- Using 32 days ensures we don’t miss the next month, and resetting the day to 1 (
.replace(day=1)
) ensures we always land on the first day of that month.
Example Walkthrough:
- Starting Date:
2024-09-15
Step-by-step:current_date.replace(day=1)
→2024-09-01
+ timedelta(days=32)
→2024-10-03
(moved 32 days forward).replace(day=1)
→2024-10-01
(reset to the 1st day of the month)
So after the line executes, you will have moved to the first day of the next month, which in this example is October 1st, 2024.
Why not just add a month?
Python’s datetime
module does not directly support adding months due to the varying number of days in each month. Using this method, however, allows us to reliably move to the next month without worrying about how many days the current month has.
suppose there is sales table we want to quickly find the avg of three months sales values ,8month back we need to get this avg. xyminus9, xyminus10, xyminus12 will be the monthperiod to consider . Now sales table has a field sales_date with values ’13-05-2024′ such -date as string, so how to work this out with help of our dictionary
To achieve this, we can use the dynamic month variables stored in the dictionary and filter the sales data based on those month periods (xyminus9
, xyminus10
, xyminus12
). Here’s a breakdown of how you can accomplish this:
Steps:
- Create a dynamic dictionary that holds month and year combinations like
xyminus9
,xyminus10
, etc. - Extract month and year from the
sales_date
(which is stored as a string in the format'DD-MM-YYYY'
). - Filter the sales table based on the month periods you want (
xyminus9
,xyminus10
,xyminus12
). - Calculate the average sales for these filtered records.
Implementation:
- Generate the dictionary (similar to what we did earlier): We’ll generate variables like
xyminus1
,xyminus2
, and so on, holding theyyyymm
format. - Query the sales table using the dictionary values for the months you want (
xyminus9
,xyminus10
,xyminus12
).
Here’s a Python example to demonstrate:
from datetime import datetime, timedelta
# Function to generate month variables for the past months
def generate_past_date_vars(start_date=None, months_to_generate=12):
if start_date is None:
start_date = datetime.now() # Default to current date
# Prepare the dictionary to store the generated variables
date_vars = {}
# Loop to generate past month-year values dynamically
current_date = start_date
for i in range(1, months_to_generate + 1):
# Generate the key based on the index (negative to represent past)
key = f"xyminus{i}"
# Get the yyyyMM format
yyyymm = current_date.strftime("%Y%m")
# Add to the dictionary
date_vars[key] = yyyymm
# Move to the previous month
current_date = (current_date.replace(day=1) - timedelta(days=1)).replace(day=1)
return date_vars
# Example: Generate last 12 months' date variables starting from today
past_date_vars = generate_past_date_vars()
# Now the dictionary will contain values like:
# xyminus1: '202409', xyminus2: '202408', ..., xyminus9: '202401'
# Print the months of interest
print("Months of interest:")
print(past_date_vars['xyminus9'], past_date_vars['xyminus10'], past_date_vars['xyminus12'])
Sample Output for the Dictionary:
If the current date is September 2024, you would get:
vbnetCopy codexyminus9: '202401'
xyminus10: '202312'
xyminus12: '202310'
Now, to filter your sales
table:
Assuming you have a sales
table with a sales_date
field formatted as DD-MM-YYYY
, you need to:
- Convert the
sales_date
intoyyyymm
format. - Filter the records for
xyminus9
,xyminus10
, andxyminus12
. - Calculate the average of the sales.
Here’s an example using pandas
:
import pandas as pd
# Sample sales data (assuming 'sales_date' is a string in 'DD-MM-YYYY' format)
data = {'sales_date': ['13-05-2024', '20-12-2023', '15-01-2024', '01-10-2023'],
'sales_amount': [1000, 1500, 1200, 1800]}
sales_df = pd.DataFrame(data)
# Convert 'sales_date' to datetime format and then extract 'yyyymm'
sales_df['sales_date'] = pd.to_datetime(sales_df['sales_date'], format='%d-%m-%Y')
sales_df['yyyymm'] = sales_df['sales_date'].dt.strftime('%Y%m')
# Define the months of interest (xyminus9, xyminus10, xyminus12)
months_of_interest = [past_date_vars['xyminus9'], past_date_vars['xyminus10'], past_date_vars['xyminus12']]
# Filter the sales data for these months
filtered_sales = sales_df[sales_df['yyyymm'].isin(months_of_interest)]
# Calculate the average sales for the filtered months
average_sales = filtered_sales['sales_amount'].mean()
# Display the result
print("Average sales for the months of interest:", average_sales)
Explanation:
- Convert
sales_date
toyyyymm
format: This allows you to match the sales data with the months in your dictionary. - Filter the
sales_df
DataFrame: Using theisin()
function, filter rows where theyyyymm
value matchesxyminus9
,xyminus10
, orxyminus12
. - Calculate the average: Use the
.mean()
function to compute the average sales amount for the selected rows.
Expected Output:
If the sales data has matching records for the months xyminus9
, xyminus10
, and xyminus12
, you will get the average sales value for those months.
Conclusion:
- You can easily use the dictionary (
past_date_vars
) to dynamically filter the relevant sales data. - The process involves converting
sales_date
to a comparable format (yyyymm
), filtering based on the desired months, and calculating the average.
here are a few additional methods we can consider for dynamically working with dates and filtering sales data.
1. Using SQL Query with a Hive Table
If you are using Hive for querying your sales data, you can create a similar solution directly in SQL. Here’s an example of how you can filter your Hive table for specific months dynamically:
Steps:
- Create a dynamic date range using Python or SQL.
- Filter Hive table using
yyyymm
format in SQL. - Aggregate and calculate average sales within the SQL query.
SQL Example:
Assume the sales table has a sales_date
stored as a string in the format 'DD-MM-YYYY'
:
WITH sales_data AS (
SELECT
sales_date,
sales_amount,
from_unixtime(unix_timestamp(sales_date, 'dd-MM-yyyy'), 'yyyyMM') AS yyyymm
FROM sales_table
)
SELECT
AVG(sales_amount) AS avg_sales
FROM
sales_data
WHERE
yyyymm IN ('202401', '202312', '202310'); -- These are the months xyminus9, xyminus10, and xyminus12
Explanation:
from_unixtime()
: Converts your stringsales_date
to a Unix timestamp, then formats it asyyyymm
.WHERE
clause: Filters based on the dynamic month values (xyminus9
,xyminus10
, etc.), which can be passed as parameters.AVG()
: Aggregates and calculates the average sales.
This approach offloads the computation to Hive, making it more efficient for large datasets.
2. Using Python’s dateutil.relativedelta
for Date Arithmetic
If you prefer using a library that handles date arithmetic in a more readable way, the dateutil
library is a great alternative to manage relative date calculations.
from datetime import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd
# Initialize the current date
current_date = datetime.now()
# Function to generate previous months dynamically using relativedelta
def get_previous_months(num_months):
months = {}
for i in range(1, num_months + 1):
month_date = current_date - relativedelta(months=i)
months[f"xyminus{i}"] = month_date.strftime("%Y%m")
return months
# Generate the last 12 months' yyyymm
past_date_vars = get_previous_months(12)
print(past_date_vars)
# Example to filter sales data using pandas DataFrame
sales_data = {'sales_date': ['13-05-2024', '15-01-2024', '01-10-2023'],
'sales_amount': [1000, 1200, 1800]}
sales_df = pd.DataFrame(sales_data)
# Convert 'sales_date' to yyyymm
sales_df['sales_date'] = pd.to_datetime(sales_df['sales_date'], format='%d-%m-%Y')
sales_df['yyyymm'] = sales_df['sales_date'].dt.strftime('%Y%m')
# Filter for xyminus9, xyminus10, xyminus12
months_of_interest = [past_date_vars['xyminus9'], past_date_vars['xyminus10'], past_date_vars['xyminus12']]
filtered_sales = sales_df[sales_df['yyyymm'].isin(months_of_interest)]
# Calculate average sales
average_sales = filtered_sales['sales_amount'].mean()
print("Average Sales:", average_sales)
Key Benefits:
relativedelta
is more intuitive for manipulating dates by months, years, etc.- It allows you to easily handle dynamic month generation and is especially useful for more complex date calculations.
3. Using Hive Window Functions
If you want to compute rolling averages or other aggregations directly in your Hive table, you can use window functions. This is particularly useful when you need to compute averages over a sliding window of months.
Example SQL Query:
WITH sales_data AS (
SELECT
sales_date,
sales_amount,
from_unixtime(unix_timestamp(sales_date, 'dd-MM-yyyy'), 'yyyyMM') AS yyyymm
FROM sales_table
)
SELECT
yyyymm,
sales_amount,
AVG(sales_amount) OVER (PARTITION BY sales_table ORDER BY yyyymm ROWS BETWEEN 9 PRECEDING AND 9 FOLLOWING) AS avg_sales
FROM sales_data
WHERE
yyyymm IN ('202401', '202312', '202310');
Explanation:
- Window function (
OVER
): Calculates a rolling average over a window of months. - This is useful if you need to calculate averages not only for a fixed set of months but for a rolling period.
4. Using PySpark for Large-Scale Data Processing
For large datasets, PySpark can be a great option for distributed data processing. You can easily handle the dynamic date generation using Python, and then apply transformations on your sales data.
Here’s a simplified PySpark example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import date_format, col
# Create Spark session
spark = SparkSession.builder.appName("sales_avg").getOrCreate()
# Sample sales data
data = [("13-05-2024", 1000), ("20-12-2023", 1500), ("15-01-2024", 1200), ("01-10-2023", 1800)]
columns = ["sales_date", "sales_amount"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Convert sales_date to proper date type and extract yyyymm
df = df.withColumn("sales_date", date_format("sales_date", "dd-MM-yyyy"))
.withColumn("yyyymm", date_format("sales_date", "yyyyMM"))
# Define the months of interest
months_of_interest = ['202401', '202312', '202310']
# Filter based on dynamic months
df_filtered = df.filter(col("yyyymm").isin(months_of_interest))
# Calculate average sales
df_filtered.groupBy().avg("sales_amount").show()
Benefits of Using PySpark:
- Scalable: Can process very large datasets.
- Distributed: Leverages the power of distributed computing for efficient filtering and aggregations.
5. Using Pandas with Date Ranges
You can leverage Pandas to create a range of dates and filter your sales data using these ranges.
import pandas as pd
# Define the range of dates for the last 12 months
today = pd.to_datetime("today")
date_range = pd.date_range(end=today, periods=12, freq='M').strftime('%Y%m')
# Example sales DataFrame
sales_data = {'sales_date': ['13-05-2024', '20-12-2023', '15-01-2024', '01-10-2023'],
'sales_amount': [1000, 1500, 1200, 1800]}
sales_df = pd.DataFrame(sales_data)
# Convert sales_date to yyyymm
sales_df['sales_date'] = pd.to_datetime(sales_df['sales_date'], format='%d-%m-%Y')
sales_df['yyyymm'] = sales_df['sales_date'].dt.strftime('%Y%m')
# Filter the sales data for the last 12 months
filtered_sales = sales_df[sales_df['yyyymm'].isin(date_range)]
# Calculate average sales
average_sales = filtered_sales['sales_amount'].mean()
print("Average Sales:", average_sales)
Conclusion:
Depending on your specific environment and dataset size, you can choose:
- SQL or Hive queries if you want to compute averages directly on the server.
- Pandas or PySpark if you’re working with Python and handling data in-memory or in a distributed fashion.
relativedelta
and date arithmetic libraries to simplify dynamic month generation.
Leave a Reply