Regular expressions (regex) are a powerful tool for matching patterns in text. Python’s re
module provides functions and tools for working with regular expressions. Here’s a complete tutorial on using regex in Python.
1. Importing the re
Module
To use regular expressions in Python, you need to import the re
module:
import re
2. Basic Functions
re.search()
Searches for a pattern within a string and returns a match object if found.
pattern = r"d+" # Matches one or more digits
text = "The year is 2024"
match = re.search(pattern, text)
if match:
print(f"Match found: {match.group()}") # Output: Match found: 2024
re.match()
Matches a pattern at the beginning of a string.
pattern = r"d+" # Matches one or more digits
text = "2024 is the year"
match = re.match(pattern, text)
if match:
print(f"Match found: {match.group()}") # Output: Match found: 2024
re.findall()
Finds all non-overlapping matches of a pattern in a string.
pattern = r"d+" # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}") # Output: Matches found: ['2024', '2025', '2026']
re.finditer()
Returns an iterator yielding match objects for all non-overlapping matches.
pattern = r"d+" # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
matches = re.finditer(pattern, text)
for match in matches:
print(f"Match found: {match.group()}") # Output: 2024, 2025, 2026
re.sub()
Replaces occurrences of a pattern with a specified string.
pattern = r"d+" # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
new_text = re.sub(pattern, "YEAR", text)
print(new_text) # Output: The years are YEAR, YEAR, and YEAR
3. Special Characters
.
: Matches any character except a newline.^
: Matches the start of the string.$
: Matches the end of the string.*
: Matches 0 or more repetitions of the preceding element.+
: Matches 1 or more repetitions of the preceding element.?
: Matches 0 or 1 repetition of the preceding element.{m,n}
: Matches fromm
ton
repetitions of the preceding element.[]
: Matches any one of the characters inside the brackets.|
: Matches either the pattern before or the pattern after the|
.()
: Groups patterns and captures matches.
4. Escaping Special Characters
To match a literal special character, escape it with a backslash ().
pattern = r"$100" # Matches the string "$100"
text = "The price is $100"
match = re.search(pattern, text)
if match:
print(f"Match found: {match.group()}") # Output: Match found: $100
5. Character Classes
d
: Matches any digit (equivalent to[0-9]
).D
: Matches any non-digit.w
: Matches any alphanumeric character (equivalent to[a-zA-Z0-9_]
).W
: Matches any non-alphanumeric character.s
: Matches any whitespace character (spaces, tabs, newlines).S
: Matches any non-whitespace character.
6. Groups and Capturing
You can use parentheses to create groups and capture parts of the match.
pattern = r"(d{4})-(d{2})-(d{2})" # Matches dates in the format YYYY-MM-DD
text = "Today's date is 2024-06-27"
match = re.search(pattern, text)
if match:
year, month, day = match.groups()
print(f"Year: {year}, Month: {month}, Day: {day}") # Output: Year: 2024, Month: 06, Day: 27
7. Named Groups
Named groups allow you to assign a name to a capturing group for easier access.
pattern = r"(?P<year>d{4})-(?P<month>d{2})-(?P<day>d{2})" # Matches dates in the format YYYY-MM-DD
text = "Today's date is 2024-06-27"
match = re.search(pattern, text)
if match:
print(f"Year: {match.group('year')}, Month: {match.group('month')}, Day: {match.group('day')}") # Output: Year: 2024, Month: 06, Day: 27
8. Lookahead and Lookbehind
Lookaheads and lookbehinds are assertions that allow you to match a pattern only if it is (or isn’t) followed or preceded by another pattern.
Positive Lookahead
pattern = r"d+(?= dollars)" # Matches digits followed by "dollars"
text = "The price is 100 dollars"
match = re.search(pattern, text)
if match:
print(f"Match found: {match.group()}") # Output: Match found: 100
Negative Lookahead
pattern = r"d+(?! dollars)" # Matches digits not followed by "dollars"
text = "The price is 100 dollars or 200 euros"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}") # Output: Matches found: ['200']
Positive Lookbehind
pattern = r"(?<=$)d+" # Matches digits preceded by a dollar sign
text = "The price is $100"
match = re.search(pattern, text)
if match:
print(f"Match found: {match.group()}") # Output: Match found: 100
Negative Lookbehind
pattern = r"(?<!$)d+" # Matches digits not preceded by a dollar sign
text = "The price is 100 dollars or $200"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}") # Output: Matches found: ['100']
9. Compiling Regular Expressions
For efficiency, especially when using the same regex multiple times, you can compile a regular expression.
pattern = re.compile(r"d{4}-d{2}-d{2}") # Compile the regex pattern
text = "The dates are 2024-06-27 and 2025-07-28"
# Use the compiled pattern
matches = pattern.findall(text)
print(f"Matches found: {matches}") # Output: Matches found: ['2024-06-27', '2025-07-28']
10. Flags
You can use flags to modify the behavior of regex functions. Common flags include:
re.IGNORECASE
(re.I
): Ignore case.re.MULTILINE
(re.M
): Multi-line matching, affects^
and$
.re.DOTALL
(re.S
): Dot matches all characters, including newline.
pattern = r"^hello"
text = "Hellonhello"
# Without flag
matches = re.findall(pattern, text)
print(f"Matches without flag: {matches}") # Output: Matches without flag: ['hello']
# With IGNORECASE and MULTILINE flags
matches = re.findall(pattern, text, re.IGNORECASE | re.MULTILINE)
print(f"Matches with flags: {matches}") # Output: Matches with flags: ['Hello', 'hello']
Example: Searching for Emails in a Database
Suppose you have a database with user data, and you want to extract all email addresses.
import re
# Sample data representing rows in a database
data = [
"Alice, alice@example.com, 123-456-7890",
"Bob, bob123@gmail.com, 987-654-3210",
"Charlie, charlie123@company.org, 555-555-5555",
"Invalid data, no email here, 000-000-0000"
]
# Regex pattern for matching email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}'
emails = []
for entry in data:
found_emails = re.findall(email_pattern, entry)
emails.extend(found_emails)
print("Extracted Emails:")
for email in emails:
print(email)
Example: Searching for Code Snippets in a Code Repository
Suppose you have a repository with Python files, and you want to find all instances of function definitions.
import re
import os
# Directory containing Python files
code_directory = 'path_to_code_repository'
# Regex pattern for matching function definitions in Python
function_pattern = r'defs+([a-zA-Z_][a-zA-Z0-9_]*)s*(.*)s*:'
functions = []
# Walk through the directory and process each Python file
for root, dirs, files in os.walk(code_directory):
for file in files:
if file.endswith('.py'):
file_path = os.path.join(root, file)
with open(file_path, 'r') as f:
file_content = f.read()
found_functions = re.findall(function_pattern, file_content)
for func in found_functions:
functions.append((file, func))
print("Found Functions:")
for file, func in functions:
print(f"Function '{func}' found in file '{file}'")
Leave a Reply