Python Regex complete tutorial with usecases of email search inside whole dbms or code search inside a code repository

Regular expressions (regex) are a powerful tool for matching patterns in text. Python’s re module provides functions and tools for working with regular expressions. Here’s a complete tutorial on using regex in Python.

1. Importing the re Module

To use regular expressions in Python, you need to import the re module:

import re

2. Basic Functions

re.search()

Searches for a pattern within a string and returns a match object if found.

pattern = r"\d+"  # Matches one or more digits
text = "The year is 2024"
match = re.search(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: 2024

re.match()

Matches a pattern at the beginning of a string.

pattern = r"\d+"  # Matches one or more digits
text = "2024 is the year"
match = re.match(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: 2024

re.findall()

Finds all non-overlapping matches of a pattern in a string.

pattern = r"\d+"  # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}")  # Output: Matches found: ['2024', '2025', '2026']

re.finditer()

Returns an iterator yielding match objects for all non-overlapping matches.

pattern = r"\d+"  # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
matches = re.finditer(pattern, text)
for match in matches:
    print(f"Match found: {match.group()}")  # Output: 2024, 2025, 2026

re.sub()

Replaces occurrences of a pattern with a specified string.

pattern = r"\d+"  # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
new_text = re.sub(pattern, "YEAR", text)
print(new_text)  # Output: The years are YEAR, YEAR, and YEAR

3. Special Characters

  • .: Matches any character except a newline.
  • ^: Matches the start of the string.
  • $: Matches the end of the string.
  • *: Matches 0 or more repetitions of the preceding element.
  • +: Matches 1 or more repetitions of the preceding element.
  • ?: Matches 0 or 1 repetition of the preceding element.
  • {m,n}: Matches from m to n repetitions of the preceding element.
  • []: Matches any one of the characters inside the brackets.
  • |: Matches either the pattern before or the pattern after the |.
  • () : Groups patterns and captures matches.

4. Escaping Special Characters

To match a literal special character, escape it with a backslash (\).

pattern = r"\$100"  # Matches the string "$100"
text = "The price is $100"
match = re.search(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: $100

5. Character Classes

  • \d: Matches any digit (equivalent to [0-9]).
  • \D: Matches any non-digit.
  • \w: Matches any alphanumeric character (equivalent to [a-zA-Z0-9_]).
  • \W: Matches any non-alphanumeric character.
  • \s: Matches any whitespace character (spaces, tabs, newlines).
  • \S: Matches any non-whitespace character.

6. Groups and Capturing

You can use parentheses to create groups and capture parts of the match.

pattern = r"(\d{4})-(\d{2})-(\d{2})"  # Matches dates in the format YYYY-MM-DD
text = "Today's date is 2024-06-27"
match = re.search(pattern, text)
if match:
    year, month, day = match.groups()
    print(f"Year: {year}, Month: {month}, Day: {day}")  # Output: Year: 2024, Month: 06, Day: 27

7. Named Groups

Named groups allow you to assign a name to a capturing group for easier access.

pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"  # Matches dates in the format YYYY-MM-DD
text = "Today's date is 2024-06-27"
match = re.search(pattern, text)
if match:
    print(f"Year: {match.group('year')}, Month: {match.group('month')}, Day: {match.group('day')}")  # Output: Year: 2024, Month: 06, Day: 27

8. Lookahead and Lookbehind

Lookaheads and lookbehinds are assertions that allow you to match a pattern only if it is (or isn’t) followed or preceded by another pattern.

Positive Lookahead

pattern = r"\d+(?= dollars)"  # Matches digits followed by "dollars"
text = "The price is 100 dollars"
match = re.search(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: 100

Negative Lookahead

pattern = r"\d+(?! dollars)"  # Matches digits not followed by "dollars"
text = "The price is 100 dollars or 200 euros"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}")  # Output: Matches found: ['200']

Positive Lookbehind

pattern = r"(?<=\$)\d+"  # Matches digits preceded by a dollar sign
text = "The price is $100"
match = re.search(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: 100

Negative Lookbehind

pattern = r"(?<!\$)\d+"  # Matches digits not preceded by a dollar sign
text = "The price is 100 dollars or $200"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}")  # Output: Matches found: ['100']

9. Compiling Regular Expressions

For efficiency, especially when using the same regex multiple times, you can compile a regular expression.

pattern = re.compile(r"\d{4}-\d{2}-\d{2}")  # Compile the regex pattern
text = "The dates are 2024-06-27 and 2025-07-28"

# Use the compiled pattern
matches = pattern.findall(text)
print(f"Matches found: {matches}")  # Output: Matches found: ['2024-06-27', '2025-07-28']

10. Flags

You can use flags to modify the behavior of regex functions. Common flags include:

  • re.IGNORECASE (re.I): Ignore case.
  • re.MULTILINE (re.M): Multi-line matching, affects ^ and $.
  • re.DOTALL (re.S): Dot matches all characters, including newline.
pattern = r"^hello"
text = "Hello\nhello"

# Without flag
matches = re.findall(pattern, text)
print(f"Matches without flag: {matches}")  # Output: Matches without flag: ['hello']

# With IGNORECASE and MULTILINE flags
matches = re.findall(pattern, text, re.IGNORECASE | re.MULTILINE)
print(f"Matches with flags: {matches}")  # Output: Matches with flags: ['Hello', 'hello']

Example: Searching for Emails in a Database

Suppose you have a database with user data, and you want to extract all email addresses.

import re

# Sample data representing rows in a database
data = [
    "Alice, alice@example.com, 123-456-7890",
    "Bob, bob123@gmail.com, 987-654-3210",
    "Charlie, charlie123@company.org, 555-555-5555",
    "Invalid data, no email here, 000-000-0000"
]

# Regex pattern for matching email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

emails = []

for entry in data:
    found_emails = re.findall(email_pattern, entry)
    emails.extend(found_emails)

print("Extracted Emails:")
for email in emails:
    print(email)

Example: Searching for Code Snippets in a Code Repository

Suppose you have a repository with Python files, and you want to find all instances of function definitions.

import re
import os

# Directory containing Python files
code_directory = 'path_to_code_repository'

# Regex pattern for matching function definitions in Python
function_pattern = r'def\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(.*\)\s*:'

functions = []

# Walk through the directory and process each Python file
for root, dirs, files in os.walk(code_directory):
    for file in files:
        if file.endswith('.py'):
            file_path = os.path.join(root, file)
            with open(file_path, 'r') as f:
                file_content = f.read()
                found_functions = re.findall(function_pattern, file_content)
                for func in found_functions:
                    functions.append((file, func))

print("Found Functions:")
for file, func in functions:
    print(f"Function '{func}' found in file '{file}'")

Discover more from AI HintsToday

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Latest Entries:-

  • Data Engineering Job Interview Questions :- Datawarehouse Terms
  • Oracle Query Execution phases- How query flows?
  • Pyspark -Introduction, Components, Compared With Hadoop
  • PySpark Architecture- (Driver- Executor) , Web Interface
  • Memory Management through Hadoop Traditional map reduce vs Pyspark- explained with example of Complex data pipeline used for Both used
  • Example Spark submit command used in very complex etl Jobs
  • Deploying a PySpark job- Explain Various Methods and Processes Involved
  • What is Hive?
  • In How many ways pyspark script can be executed? Detailed explanation
  • DAG Scheduler in Spark: Detailed Explanation, How it is involved at architecture Level
  • CPU Cores, executors, executor memory in pyspark- Expalin Memory Management in Pyspark
  • Pyspark- Jobs , Stages and Tasks explained
  • A DAG Stage in Pyspark is divided into tasks based on the partitions of the data. How these partitions are decided?
  • Apache Spark- Partitioning and Shuffling
  • Discuss Spark Data Types, Spark Schemas- How Sparks infers Schema?
  • String Data Manipulation and Data Cleaning in Pyspark

Discover more from AI HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading