String manipulation is a common task in data processing. PySpark provides a variety of built-in functions for manipulating string columns in DataFrames. Below, we explore some of the most useful string manipulation functions and demonstrate how to use them with examples.
Common String Manipulation Functions
concat
: Concatenates multiple string columns into one.substring
: Extracts a substring from a string column.length
: Computes the length of a string column.trim
,ltrim
,rtrim
: Trims whitespace from strings.upper
,lower
: Converts strings to upper or lower case.regexp_replace
: Replaces substrings matching a regex pattern.regexp_extract
: Extracts substrings matching a regex pattern.split
: Splits a string column based on a delimiter.replace
: Replaces all occurrences of a substring with another substring.translate
: Replaces characters in a string based on a mapping.
Example Usage
1. Concatenation
Syntax:
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, lit
# Initialize Spark session
spark = SparkSession.builder.appName("string_manipulation").getOrCreate()
# Sample data
data = [("John", "Doe"), ("Jane", "Smith")]
columns = ["first_name", "last_name"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Concatenate first and last names
df = df.withColumn("full_name", concat(df.first_name, lit(" "), df.last_name))
# Show result
df.show()
2. Substring Extraction
Syntax:
from pyspark.sql.functions import substring
# Extract first three characters of first_name
df = df.withColumn("initials", substring(df.first_name, 1, 3))
# Show result
df.show()
3. Length Calculation
Syntax:
from pyspark.sql.functions import length
# Calculate length of first_name
df = df.withColumn("name_length", length(df.first_name))
# Show result
df.show()
4. Trimming
Syntax:
from pyspark.sql.functions import trim, ltrim, rtrim
# Trim whitespace from first_name
df = df.withColumn("trimmed_name", trim(df.first_name))
# Show result
df.show()
5. Case Conversion
Syntax:
from pyspark.sql.functions import upper, lower
# Convert first_name to upper case
df = df.withColumn("upper_name", upper(df.first_name))
# Convert last_name to lower case
df = df.withColumn("lower_name", lower(df.last_name))
# Show result
df.show()
6. Regex Replace
Syntax:
from pyspark.sql.functions import regexp_replace
# Replace all digits with asterisks
df = df.withColumn("masked_name", regexp_replace(df.first_name, "d", "*"))
# Show result
df.show()
7. Regex Extract
Syntax:
from pyspark.sql.functions import regexp_extract
# Extract digits from first_name
df = df.withColumn("extracted_digits", regexp_extract(df.first_name, "(d+)", 1))
# Show result
df.show()
8. Split
from pyspark.sql.functions import split
# Split full_name into first and last name
df = df.withColumn("split_name", split(df.full_name, " "))
# Show result
df.select("split_name").show(truncate=False)
9. Replace
from pyspark.sql.functions import expr
# Replace 'Doe' with 'Smith' in last_name
df = df.withColumn("updated_last_name", expr("replace(last_name, 'Doe', 'Smith')"))
# Show result
df.show()
10. Translate
Syntax:
from pyspark.sql.functions import translate
# Translate 'o' to '0' in last_name
df = df.withColumn("translated_last_name", translate(df.last_name, "o", "0"))
# Show result
df.show()
Comprehensive Example: Combining Multiple String Manipulations
Let’s create a project that combines multiple string manipulation operations on a DataFrame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, lit, substring, length, trim, upper, lower, regexp_replace, regexp_extract, split, expr, translate
# Initialize Spark session
spark = SparkSession.builder.appName("string_manipulation").getOrCreate()
# Sample data
data = [("John123", "Doe456"), ("Jane789", "Smith012")]
columns = ["first_name", "last_name"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Perform various string manipulations
df = df.withColumn("full_name", concat(df.first_name, lit(" "), df.last_name))
df = df.withColumn("initials", substring(df.first_name, 1, 3))
df = df.withColumn("name_length", length(df.first_name))
df = df.withColumn("trimmed_name", trim(df.first_name))
df = df.withColumn("upper_name", upper(df.first_name))
df = df.withColumn("lower_name", lower(df.last_name))
df = df.withColumn("masked_name", regexp_replace(df.first_name, "d", "*"))
df = df.withColumn("extracted_digits", regexp_extract(df.first_name, "(d+)", 1))
df = df.withColumn("split_name", split(df.full_name, " "))
df = df.withColumn("updated_last_name", expr("replace(last_name, 'Doe456', 'Smith')"))
df = df.withColumn("translated_last_name", translate(df.last_name, "456", "789"))
# Show result
df.show(truncate=False)
Explanation
- Concatenation: Combines the
first_name
andlast_name
with a space. - Substring Extraction: Extracts the first three characters of
first_name
. - Length Calculation: Computes the length of
first_name
. - Trimming: Trims any whitespace around
first_name
. - Case Conversion: Converts
first_name
to upper case andlast_name
to lower case. - Regex Replace: Replaces all digits in
first_name
with asterisks. - Regex Extract: Extracts digits from
first_name
. - Split: Splits the
full_name
into a list offirst_name
andlast_name
. - Replace: Replaces ‘Doe456’ with ‘Smith’ in
last_name
. - Translate: Translates ‘456’ to ‘789’ in
last_name
.
These examples demonstrate the versatility of string manipulation functions in PySpark, allowing for complex transformations and processing of text data.
Here’s a table summarizing common string manipulation functions in PySpark SQL, including regular expressions (regex):
Function | Description | Example (Input: “Hello World”) |
---|---|---|
initcap(str) | Converts first letter of each word to uppercase | “Hello World” -> “Hello World” (already capitalized words remain unchanged) |
lower(str) | Converts all characters to lowercase | “Hello World” -> “hello world” |
upper(str) | Converts all characters to uppercase | “Hello World” -> “HELLO WORLD” |
lpad(str, length, pad_string) | Pads a string to the left with a specified string | “Hello World”, 15, “-” -> “—Hello World” |
rpad(str, length, pad_string) | Pads a string to the right with a specified string | “Hello World”, 15, “-” -> “Hello World—“ |
ltrim(str) | Removes leading whitespace characters | ” Hello World” -> “Hello World” |
rtrim(str) | Removes trailing whitespace characters | “Hello World ” -> “Hello World” |
trim(str) | Removes leading and trailing whitespace characters | ” Hello World ” -> “Hello World” |
length(str) | Returns the length (number of characters) of a string | “Hello World” -> 11 |
substr(str, pos, len) | Extracts a substring from a string | “Hello World”, 7, 5 -> “World” |
instr(str, substr) | Returns the starting position of a substring within a string (0 if not found) | “Hello World”, “World” -> 7 |
concat(str1, str2…) | Concatenates (joins) multiple strings | “Hello”, ” “, “World” -> “Hello World” |
concat_ws(sep, str1, str2…) | Concatenates strings with a specified separator | “Hello”, “,”, “World” -> “Hello,World” |
regexp_extract(str, pattern) | Extracts a pattern (regular expression) from a string | “Hello_World”, r”[^]+” -> “_World” |
regexp_replace(str, pattern, replacement) | Replaces a pattern (regular expression) in a string with a replacement string | “Hello_World”, r”_”, ” ” -> “Hello World” |
split(str, delimiter) | Splits a string into an array based on a delimiter | “Hello,World,How”, “,” -> [“Hello”, “World”, “How”] |
array_join(arr, sep) | Joins the elements of an array into a string with a specified separator | [“Hello”, “World”, “How”], “,” -> “Hello,World,How” |
soundex(str) | Returns the Soundex code of a string (phonetic equivalent) | “Hello” -> “H400” |
translate(str, remove, replace) | Removes characters from a string and replaces them with | “Hello World”, “e”, “” -> “Hllo World” |
overlay(str, overlay_str, pos, len) | Replaces a substring within a string with another | “Hello World”, “there”, 6, 5 -> “Hello thered” |
reverse(str) | Reverses the order of characters in a string | “Hello World” -> “dlroW olleH” |
instrc(str, substr) | Returns the starting position of a substring within a string (case-insensitive, 0 if not found) | “Hello World”, “world” -> 7 |
levenshtein(str1, str2) | Calculates the Levenshtein distance (minimum number of edits) to transform one string to another | “Hello”, “Jello” -> 1 |
Summary in Detail:-
a guide on how to perform common string manipulation tasks in PySpark:-
concat: Concatenates two or more strings.
Syntax: concat(col1, col2, ..., colN)
Example:
from pyspark.sql.functions import concat, col
df = df.withColumn("Full Name", concat(col("First Name"), lit(" "), col("Last Name")))
substr: Extracts a substring from a string.
Syntax: substr(col, start, length)
Example:
from pyspark.sql.functions import substr, col
df = df.withColumn("First Name", substr(col("Name"), 1, 4))
split: Splits a string into an array of substrings.
Syntax: split(col, pattern)
Example:
from pyspark.sql.functions import split, col
df = df.withColumn("Address Parts", split(col("Address"), " "))
regex_extract: Extracts a substring using a regular expression.
Syntax: regex_extract(col, pattern, group)
Example:
from pyspark.sql.functions import regex_extract, col
df = df.withColumn("Phone Number", regex_extract(col("Contact Info"), "d{3}-d{3}-d{4}", 0))
translate: Replaces specified characters in a string.
Syntax: translate(col, matching, replace)
Example:
from pyspark.sql.functions import translate, col
df = df.withColumn("Clean Name", translate(col("Name"), "aeiou", "AEIOU"))
trim: Removes leading and trailing whitespace from a string.
Syntax: trim(col)
Example:
from pyspark.sql.functions import trim, col
df = df.withColumn("Clean Address", trim(col("Address")))
lower: Converts a string to lowercase.
Syntax: lower(col)
Example:
from pyspark.sql.functions import lower, col
df = df.withColumn("Lower Name", lower(col("Name")))
upper: Converts a string to uppercase.
Syntax: upper(col)
Example:
from pyspark.sql.functions import upper, col
df = df.withColumn("Upper Name", upper(col("Name")))
String Data Cleaning in PySpark
Here are some common string data cleaning functions in PySpark, along with their syntax and examples:
trim: Removes leading and trailing whitespace from a string.
Syntax: trim(col)
Example:
from pyspark.sql.functions import trim, col
df = df.withColumn("Clean Address", trim(col("Address")))
regexp_replace: Replaces substrings matching a regular expression.
Syntax: regexp_replace(col, pattern, replacement)
Example:
from pyspark.sql.functions import regexp_replace, col
df = df.withColumn("Clean Name", regexp_replace(col("Name"), "[^a-zA-Z]", ""))
replace: Replaces specified characters or substrings in a string.
Syntax: replace(col, matching, replace)
Example:
from pyspark.sql.functions import replace, col
df = df.withColumn("Clean Address", replace(col("Address"), " ", ""))
remove_accents: Removes accents from a string.
Syntax: remove_accents(col)
Example:
from pyspark.sql.functions import remove_accents, col
df = df.withColumn("Clean Name", remove_accents(col("Name")))
standardize: Standardizes a string by removing punctuation and converting to lowercase.
Syntax: standardize(col)
Example:
from pyspark.sql.functions import standardize, col
df = df.withColumn("Standardized Name", standardize(col("Name")))
Good Post but diagrams or chart wise explaining would have been better.