Useful Code Snippets in Python and Pyspark

More widely used Python string use cases in PySpark ETL automation, DataFrame column handling, and query generation.

1. Convert All Column Names to Lowercase

When working with PySpark DataFrames, column names may be in mixed case, which can cause issues in queries. We can normalize them using Python string methods.

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("String_Usecases").getOrCreate()

# Sample DataFrame
df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["ID", "Name_New"])

# Convert column names to lowercase
df = df.toDF(*[col.lower() for col in df.columns])

df.show()

2. Remove `_new` from Column Names

If column names contain a suffix like _new, we can remove it dynamically.

df = df.toDF(*[col.replace("_new", "") for col in df.columns])

df.show()

3. Using f-strings and `.format()` for Dynamic Table Names

Dynamic table names are often used in PySpark ETL pipelines.

Using f-strings:

table_name = "customer_data"
query = f"SELECT * FROM {table_name} WHERE active_flag = 1"

df = spark.sql(query)
df.show()

Using .format():

query = "SELECT * FROM {} WHERE active_flag = 1".format(table_name)
df = spark.sql(query)
df.show()

4. Using Strings for Dynamic Filtering

If you need to filter a DataFrame dynamically based on a condition:

filter_column = "status"
filter_value = "active"

df_filtered = df.filter(f"{filter_column} = '{filter_value}'")
df_filtered.show()

Using .format():

df_filtered = df.filter("{} = '{}'".format(filter_column, filter_value))
df_filtered.show()

5. String Handling in PySpark ETL Automation

In ETL automation, strings are often used to generate queries dynamically, process column names, or define configuration-based transformations.

Example: Dynamically generating an INSERT query

table_name = "sales_data"
columns = ["id", "amount", "date"]
values = [("1", "100", "2024-02-01"), ("2", "200", "2024-02-02")]

query = f"INSERT INTO {table_name} ({', '.join(columns)}) VALUES "

value_str = ", ".join([f"({', '.join(map(str, row))})" for row in values])
query += value_str

print(query)  # INSERT INTO sales_data (id, amount, date) VALUES (1, 100, '2024-02-01'), (2, 200, '2024-02-02')

6. Using Strings in UDFs for Text Processing

PySpark UDFs can be used to manipulate strings within DataFrames.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def clean_text(text):
    return text.lower().replace("_new", "")

clean_text_udf = udf(clean_text, StringType())

df = df.withColumn("cleaned_column", clean_text_udf(df["Name"]))

df.show()

7. String Normalization (Removing Special Characters)

Sometimes, column values contain unwanted special characters, and we need to clean them.

from pyspark.sql.functions import regexp_replace

df = df.withColumn("cleaned_name", regexp_replace(df["Name"], "[^a-zA-Z0-9 ]", ""))
df.show()

8. String Padding (Adding Leading Zeros)

For standardizing IDs, we may need to pad them with leading zeros.

from pyspark.sql.functions import lpad

df = df.withColumn("padded_id", lpad(df["ID"].cast("string"), 5, "0"))
df.show()

9. Extracting Substrings from Columns

We can extract specific portions of strings using substring functions.

from pyspark.sql.functions import substring

df = df.withColumn("first_two_chars", substring(df["Name"], 1, 2))
df.show()

10. Splitting Strings into Multiple Columns

For example, splitting full names into first and last names.

from pyspark.sql.functions import split

df = df.withColumn("Name_Split", split(df["Name"], " "))

df = df.withColumn("First_Name", df["Name_Split"].getItem(0)) \
       .withColumn("Last_Name", df["Name_Split"].getItem(1))

df.show()

11. Concatenating Multiple String Columns

For example, combining first and last names.

from pyspark.sql.functions import concat_ws

df = df.withColumn("Full_Name", concat_ws(" ", df["First_Name"], df["Last_Name"]))
df.show()

12. Checking for Substring Presence

To filter rows containing a specific word.

from pyspark.sql.functions import col

df_filtered = df.filter(col("Name").contains("Alice"))
df_filtered.show()

13. Replacing Values in a Column

Replacing specific values dynamically.

df = df.withColumn("Name", regexp_replace(df["Name"], "Alice", "Alicia"))
df.show()

14. Converting Column Values to Uppercase/Lowercase

from pyspark.sql.functions import upper, lower

df = df.withColumn("upper_name", upper(df["Name"]))
df = df.withColumn("lower_name", lower(df["Name"]))

df.show()

15. Using Python Strings in PySpark Logging

log_message = f"Processing table {table_name} at timestamp {spark.sql('SELECT current_timestamp()').collect()[0][0]}"
print(log_message)

16. Handling Null or Empty Strings in Columns

Replacing nulls or empty strings with default values.

from pyspark.sql.functions import when

df = df.withColumn("Name", when(df["Name"] == "", "Unknown").otherwise(df["Name"]))
df.show()

17. Removing Whitespace (Trim)

from pyspark.sql.functions import trim

df = df.withColumn("trimmed_name", trim(df["Name"]))
df.show()

Summary of Use Cases

Use Case	Example
Convert column names to lowercase	`[col.lower() for col in df.columns]`
Remove `_new` from column names	`[col.replace("_new", "") for col in df.columns]`
Dynamic query generation	`f"SELECT * FROM {table_name}"`
Dynamic filtering	`df.filter(f"{filter_column} = '{filter_value}'")`
ETL automation using string formatting	`f"INSERT INTO {table} ({cols}) VALUES {values}"`
String UDF for transformation	`udf(lambda x: x.lower(), StringType())`
Removing special characters	`regexp_replace(df["col"], "[^a-zA-Z0-9 ]", "")`
String padding	`lpad(df["ID"], 5, "0")`
Extract substring	`substring(df["col"], 1, 2)`
Splitting strings	`split(df["Name"], " ")`
Concatenating columns	`concat_ws(" ", df["First_Name"], df["Last_Name"])`
Checking substring presence	`df.filter(col("col").contains("text"))`
Replacing values dynamically	`regexp_replace(df["col"], "old", "new")`
Uppercase/lowercase conversion	`upper(df["col"])`
Handling null/empty strings	`when(df["col"] == "", "Unknown").otherwise(df["col"])`
Removing whitespace	`trim(df["col"])`

HintsToday

recent posts

about