String Manipulation on PySpark DataFrames

String manipulation is a common task in data processing. PySpark provides a variety of built-in functions for manipulating string columns in DataFrames. Below, we explore some of the most useful string manipulation functions and demonstrate how to use them with examples.

Common String Manipulation Functions

concat: Concatenates multiple string columns into one.
substring: Extracts a substring from a string column.
length: Computes the length of a string column.
trim, ltrim, rtrim: Trims whitespace from strings.
upper, lower: Converts strings to upper or lower case.
regexp_replace: Replaces substrings matching a regex pattern.
regexp_extract: Extracts substrings matching a regex pattern.
split: Splits a string column based on a delimiter.
replace: Replaces all occurrences of a substring with another substring.
translate: Replaces characters in a string based on a mapping.

Example Usage

1. Concatenation

Syntax:

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, lit

# Initialize Spark session
spark = SparkSession.builder.appName("string_manipulation").getOrCreate()

# Sample data
data = [("John", "Doe"), ("Jane", "Smith")]
columns = ["first_name", "last_name"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Concatenate first and last names
df = df.withColumn("full_name", concat(df.first_name, lit(" "), df.last_name))

# Show result
df.show()

2. Substring Extraction

Syntax:

from pyspark.sql.functions import substring

# Extract first three characters of first_name
df = df.withColumn("initials", substring(df.first_name, 1, 3))

# Show result
df.show()

3. Length Calculation

Syntax:

from pyspark.sql.functions import length

# Calculate length of first_name
df = df.withColumn("name_length", length(df.first_name))

# Show result
df.show()

In PySpark, the length() function counts all characters, including:

✅ Leading/trailing spaces
✅ Special characters (!@#$%^, etc.)
✅ Unicode characters
✅ Numbers

✅ Example: Count Characters with Spaces and Specials

from pyspark.sql.functions import col, length

df = spark.createDataFrame([
    (" Alice ",),     # Leading/trailing space
    ("A$ha",),        # Special character
    ("  Bob  ",),     # More spaces
    ("Émilie",),      # Unicode
    ("",),            # Empty string
    (None,)           # NULL value
], ["first_name"])

df_with_len = df.withColumn("name_length", length(col("first_name")))
df_with_len.show(truncate=False)

✅ Output:

+-----------+------------+
|first_name |name_length |
+-----------+------------+
| Alice     |7           |
|A$ha       |4           |
|  Bob      |7           |
|Émilie     |6           |
|           |0           |
|null       |null        |
+-----------+------------+

🧠 Key Notes:

length() is character count, not byte count.
It includes all types of characters (whitespace, special, Unicode).
Use trim() if you want to remove spaces before counting.

⚠️ Compare With Trim:

from pyspark.sql.functions import trim

df_trimmed = df.withColumn("trimmed_length", length(trim(col("first_name"))))
df_trimmed.show()

4. Trimming

Syntax:

from pyspark.sql.functions import trim, ltrim, rtrim

# Trim whitespace from first_name
df = df.withColumn("trimmed_name", trim(df.first_name))

# Show result
df.show()

5. Case Conversion

Syntax:

from pyspark.sql.functions import upper, lower

# Convert first_name to upper case
df = df.withColumn("upper_name", upper(df.first_name))

# Convert last_name to lower case
df = df.withColumn("lower_name", lower(df.last_name))

# Show result
df.show()

6. Regex Replace

Syntax:

from pyspark.sql.functions import regexp_replace

# Replace all digits with asterisks
df = df.withColumn("masked_name", regexp_replace(df.first_name, "d", "*"))

# Show result
df.show()

7. Regex Extract

Syntax:

from pyspark.sql.functions import regexp_extract

# Extract digits from first_name
df = df.withColumn("extracted_digits", regexp_extract(df.first_name, "(d+)", 1))

# Show result
df.show()

8. Split

from pyspark.sql.functions import split

# Split full_name into first and last name
df = df.withColumn("split_name", split(df.full_name, " "))

# Show result
df.select("split_name").show(truncate=False)

9. Replace

from pyspark.sql.functions import expr

# Replace 'Doe' with 'Smith' in last_name
df = df.withColumn("updated_last_name", expr("replace(last_name, 'Doe', 'Smith')"))

# Show result
df.show()

10. Translate

Syntax:

from pyspark.sql.functions import translate

# Translate 'o' to '0' in last_name
df = df.withColumn("translated_last_name", translate(df.last_name, "o", "0"))

# Show result
df.show()

Comprehensive Example: Combining Multiple String Manipulations

Let’s create a project that combines multiple string manipulation operations on a DataFrame.

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, lit, substring, length, trim, upper, lower, regexp_replace, regexp_extract, split, expr, translate

# Initialize Spark session
spark = SparkSession.builder.appName("string_manipulation").getOrCreate()

# Sample data
data = [("John123", "Doe456"), ("Jane789", "Smith012")]
columns = ["first_name", "last_name"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Perform various string manipulations
df = df.withColumn("full_name", concat(df.first_name, lit(" "), df.last_name))
df = df.withColumn("initials", substring(df.first_name, 1, 3))
df = df.withColumn("name_length", length(df.first_name))
df = df.withColumn("trimmed_name", trim(df.first_name))
df = df.withColumn("upper_name", upper(df.first_name))
df = df.withColumn("lower_name", lower(df.last_name))
df = df.withColumn("masked_name", regexp_replace(df.first_name, "d", "*"))
df = df.withColumn("extracted_digits", regexp_extract(df.first_name, "(d+)", 1))
df = df.withColumn("split_name", split(df.full_name, " "))
df = df.withColumn("updated_last_name", expr("replace(last_name, 'Doe456', 'Smith')"))
df = df.withColumn("translated_last_name", translate(df.last_name, "456", "789"))

# Show result
df.show(truncate=False)

Explanation

Concatenation: Combines the first_name and last_name with a space.
Substring Extraction: Extracts the first three characters of first_name.
Length Calculation: Computes the length of first_name.
Trimming: Trims any whitespace around first_name.
Case Conversion: Converts first_name to upper case and last_name to lower case.
Regex Replace: Replaces all digits in first_name with asterisks.
Regex Extract: Extracts digits from first_name.
Split: Splits the full_name into a list of first_name and last_name.
Replace: Replaces ‘Doe456’ with ‘Smith’ in last_name.
Translate: Translates ‘456’ to ‘789’ in last_name.

These examples demonstrate the versatility of string manipulation functions in PySpark, allowing for complex transformations and processing of text data.

Here’s a table summarizing common string manipulation functions in PySpark SQL, including regular expressions (regex):

Function	Description	Example (Input: “Hello World”)
initcap(str)	Converts first letter of each word to uppercase	“Hello World” -> “Hello World” (already capitalized words remain unchanged)
lower(str)	Converts all characters to lowercase	“Hello World” -> “hello world”
upper(str)	Converts all characters to uppercase	“Hello World” -> “HELLO WORLD”
lpad(str, length, pad_string)	Pads a string to the left with a specified string	“Hello World”, 15, “-” -> “—Hello World”
rpad(str, length, pad_string)	Pads a string to the right with a specified string	“Hello World”, 15, “-” -> “Hello World—“
ltrim(str)	Removes leading whitespace characters	” Hello World” -> “Hello World”
rtrim(str)	Removes trailing whitespace characters	“Hello World ” -> “Hello World”
trim(str)	Removes leading and trailing whitespace characters	” Hello World ” -> “Hello World”
length(str)	Returns the length (number of characters) of a string	“Hello World” -> 11
substr(str, pos, len)	Extracts a substring from a string	“Hello World”, 7, 5 -> “World”
instr(str, substr)	Returns the starting position of a substring within a string (0 if not found)	“Hello World”, “World” -> 7
concat(str1, str2…)	Concatenates (joins) multiple strings	“Hello”, ” “, “World” -> “Hello World”
concat_ws(sep, str1, str2…)	Concatenates strings with a specified separator	“Hello”, “,”, “World” -> “Hello,World”
regexp_extract(str, pattern)	Extracts a pattern (regular expression) from a string	“Hello_World”, r”[^]+” -> “_World”
regexp_replace(str, pattern, replacement)	Replaces a pattern (regular expression) in a string with a replacement string	“Hello_World”, r”_”, ” ” -> “Hello World”
split(str, delimiter)	Splits a string into an array based on a delimiter	“Hello,World,How”, “,” -> [“Hello”, “World”, “How”]
array_join(arr, sep)	Joins the elements of an array into a string with a specified separator	[“Hello”, “World”, “How”], “,” -> “Hello,World,How”
soundex(str)	Returns the Soundex code of a string (phonetic equivalent)	“Hello” -> “H400”
translate(str, remove, replace)	Removes characters from a string and replaces them with	“Hello World”, “e”, “” -> “Hllo World”
overlay(str, overlay_str, pos, len)	Replaces a substring within a string with another	“Hello World”, “there”, 6, 5 -> “Hello thered”
reverse(str)	Reverses the order of characters in a string	“Hello World” -> “dlroW olleH”
instrc(str, substr)	Returns the starting position of a substring within a string (case-insensitive, 0 if not found)	“Hello World”, “world” -> 7
levenshtein(str1, str2)	Calculates the Levenshtein distance (minimum number of edits) to transform one string to another	“Hello”, “Jello” -> 1

Summary in Detail:-

a guide on how to perform common string manipulation tasks in PySpark:-

concat: Concatenates two or more strings.

Syntax: concat(col1, col2, ..., colN)

Example:

from pyspark.sql.functions import concat, col

df = df.withColumn("Full Name", concat(col("First Name"), lit(" "), col("Last Name")))

substr: Extracts a substring from a string.

Syntax: substr(col, start, length)

Example:

from pyspark.sql.functions import substr, col

df = df.withColumn("First Name", substr(col("Name"), 1, 4))

split: Splits a string into an array of substrings.

Syntax: split(col, pattern)

Example:

from pyspark.sql.functions import split, col

df = df.withColumn("Address Parts", split(col("Address"), " "))

regex_extract: Extracts a substring using a regular expression.

Syntax: regex_extract(col, pattern, group)

Example:

from pyspark.sql.functions import regex_extract, col

df = df.withColumn("Phone Number", regex_extract(col("Contact Info"), "d{3}-d{3}-d{4}", 0))

translate: Replaces specified characters in a string.

Syntax: translate(col, matching, replace)

Example:

from pyspark.sql.functions import translate, col

df = df.withColumn("Clean Name", translate(col("Name"), "aeiou", "AEIOU"))

trim: Removes leading and trailing whitespace from a string.

Syntax: trim(col)

Example:

from pyspark.sql.functions import trim, col

df = df.withColumn("Clean Address", trim(col("Address")))

lower: Converts a string to lowercase.

Syntax: lower(col)

Example:

from pyspark.sql.functions import lower, col

df = df.withColumn("Lower Name", lower(col("Name")))

upper: Converts a string to uppercase.

Syntax: upper(col)

Example:

from pyspark.sql.functions import upper, col

df = df.withColumn("Upper Name", upper(col("Name")))

String Data Cleaning in PySpark

Here are some common string data cleaning functions in PySpark, along with their syntax and examples:

trim: Removes leading and trailing whitespace from a string.

Syntax: trim(col)

Example:

from pyspark.sql.functions import trim, col

df = df.withColumn("Clean Address", trim(col("Address")))

regexp_replace: Replaces substrings matching a regular expression.

Syntax: regexp_replace(col, pattern, replacement)

Example:

from pyspark.sql.functions import regexp_replace, col

df = df.withColumn("Clean Name", regexp_replace(col("Name"), "[^a-zA-Z]", ""))

replace: Replaces specified characters or substrings in a string.

Syntax: replace(col, matching, replace)

Example:

from pyspark.sql.functions import replace, col

df = df.withColumn("Clean Address", replace(col("Address"), " ", ""))

remove_accents: Removes accents from a string.

Syntax: remove_accents(col)

Example:

from pyspark.sql.functions import remove_accents, col

df = df.withColumn("Clean Name", remove_accents(col("Name")))

standardize: Standardizes a string by removing punctuation and converting to lowercase.

Syntax: standardize(col)

Example:

from pyspark.sql.functions import standardize, col

df = df.withColumn("Standardized Name", standardize(col("Name")))

HintsToday

recent posts

about