# Apply Dynamic Logic to All Columns: Trim All String Columns
df = df.select([trim(col(c)).alias(c) if dtype == "string" else col(c) for c, dtype in df.dtypes])
Does python not have trim function inbuilt then why we want to import from pyspark sql?
You’re right to ask about this! In Python, we do have a built-in strip()
function for trimming strings. However, when working with PySpark DataFrames, we need to use PySpark functions instead of native Python functions. Here’s why:
1. PySpark Works on Distributed Data
PySpark operates on distributed data across multiple machines using the Spark engine. Standard Python functions like strip()
only work on local Python objects, not on distributed DataFrame columns.
2. Need for PySpark SQL Functions
PySpark provides its own set of SQL functions (pyspark.sql.functions
), which are optimized for distributed processing. The function trim()
from pyspark.sql.functions
works on entire DataFrame columns efficiently.
3. Why Not Use strip()
Instead?
If you tried something like:
df = df.withColumn("col1", df["col1"].strip()) # ❌ This won't work!
It would throw an error, because strip()
is a method of Python strings, not a PySpark column.
Instead, you must use:
from pyspark.sql.functions import trim
df = df.withColumn("col1", trim(df["col1"])) # ✅ This works in PySpark
4. Applying to All String Columns
The logic I suggested:
from pyspark.sql.functions import trim, col
df = df.select([trim(col(c)).alias(c) if dtype == "string" else col(c) for c, dtype in df.dtypes])
- Uses
trim(col(c))
to apply trimming only to string-type columns. - Keeps other column types unchanged.
Final Thought
If you’re working outside PySpark, use Python’s strip()
. But inside PySpark, always use pyspark.sql.functions.trim()
for better performance and distributed execution.
Yes! PySpark provides a rich set of string functions in the pyspark.sql.functions
module. Here’s a list of commonly used string functions along with their descriptions:
🔹 1. String Manipulation Functions
Function | Description | Example |
---|---|---|
trim(col) | Removes leading and trailing spaces from a string. | trim(col("name")) |
ltrim(col) | Removes leading spaces from a string. | ltrim(col("name")) |
rtrim(col) | Removes trailing spaces from a string. | rtrim(col("name")) |
lower(col) | Converts a string to lowercase. | lower(col("name")) → "hello" |
upper(col) | Converts a string to uppercase. | upper(col("name")) → "HELLO" |
initcap(col) | Capitalizes the first letter of each word. | initcap(col("name")) → "Hello World" |
reverse(col) | Reverses the string. | reverse(col("name")) → "olleH" |
concat(col1, col2, …) | Concatenates multiple string columns. | concat(col("first"), col("last")) |
concat_ws(sep, col1, col2, …) | Concatenates strings with a separator. | concat_ws("-", col("first"), col("last")) → "John-Doe" |
format_string(fmt, col1, col2, …) | Formats strings using C-style format. | format_string("Name: %s", col("name")) |
repeat(col, n) | Repeats a string n times. | repeat(col("name"), 3) → "HelloHelloHello" |
🔹 2. Substring & Splitting Functions
Function | Description | Example |
---|---|---|
substring(col, pos, len) | Extracts a substring starting at pos (1-based index) with length len. | substring(col("name"), 2, 3) → "ell" |
split(col, pattern) | Splits a string using a regex pattern and returns an array. | split(col("email"), "@") |
regexp_extract(col, pattern, group) | Extracts regex pattern from a string. | regexp_extract(col("email"), "(\\w+)@(\\w+)", 1) |
regexp_replace(col, pattern, replacement) | Replaces occurrences of a regex pattern in a string. | regexp_replace(col("name"), "a", "o") |
🔹 3. Padding Functions
Function | Description | Example |
---|---|---|
lpad(col, len, pad) | Left-pads a string with pad to a total length of len. | lpad(col("name"), 10, "*") → "*****Hello" |
rpad(col, len, pad) | Right-pads a string with pad to a total length of len. | rpad(col("name"), 10, "*") → "Hello*****" |
🔹 4. Finding & Replacing Functions
Function | Description | Example |
---|---|---|
locate(substr, col, pos=1) | Finds position of substring substr (1-based index). Returns 0 if not found. | locate("l", col("name")) → 3 |
instr(col, substr) | Finds position of substring (same as locate ). | instr(col("name"), "l") |
translate(col, from_str, to_str) | Replaces characters in a string (like tr in Unix). | translate(col("name"), "aeiou", "12345") |
overlay(col, replace, pos, len) | Replaces part of a string with another string. | overlay(col("name"), "XYZ", 2, 3) |
🔹 5. Length & Case Checking Functions
Function | Description | Example |
---|---|---|
length(col) | Returns the length of a string. | length(col("name")) |
char_length(col) | Alias for length() . | char_length(col("name")) |
octet_length(col) | Returns the number of bytes in a string. | octet_length(col("name")) |
bit_length(col) | Returns the bit length of a string. | bit_length(col("name")) |
🔹 6. Checking String Patterns
Function | Description | Example |
---|---|---|
like(col, pattern) | Checks if the column matches a SQL LIKE pattern (% = any characters, _ = single character). | col("name").like("J%") |
rlike(col, pattern) | Checks if the column matches a regex pattern. | col("name").rlike("^J.*") |
contains(col, substr) | Checks if the column contains a substring. | col("name").contains("John") |
startswith(col, substr) | Checks if the column starts with a substring. | col("name").startswith("J") |
endswith(col, substr) | Checks if the column ends with a substring. | col("name").endswith("n") |
Example Usage
Here’s how you can use these functions in a PySpark DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, trim, upper, length, split, concat_ws
# Initialize Spark
spark = SparkSession.builder.appName("StringFunctions").getOrCreate()
# Sample DataFrame
data = [(" John ", "john.doe@example.com"),
("Alice", "alice@example.com"),
("Bob ", "bob@example.com")]
df = spark.createDataFrame(data, ["name", "email"])
# Apply string functions
df = df.select(
trim(col("name")).alias("trimmed_name"), # Remove spaces
upper(col("name")).alias("uppercase_name"), # Convert to uppercase
length(col("name")).alias("name_length"), # Get string length
split(col("email"), "@").alias("email_parts"), # Split email
concat_ws("-", col("name"), col("email")).alias("full_info") # Concatenate columns
)
df.show(truncate=False)
Conclusion
PySpark has many built-in string functions that help with cleaning, formatting, and processing text data efficiently. Using native PySpark functions ensures performance optimization in a distributed environment.
So for string functions We will use inbuilt pyspark functions but then what about list and dictionary objects which are frequently used in pyspark script what functions we will use and where it will get executed considering distributed nature!!
Great question! In PySpark, handling lists and dictionaries depends on whether they are used inside:
- PySpark DataFrame operations (distributed execution)
- Python driver code (local execution)
1. List & Dictionary in PySpark (Distributed Execution)
When working with DataFrame columns, PySpark has built-in functions to manipulate lists and dictionaries in a distributed manner. These functions run on Spark workers across the cluster.
Handling Lists (ArrayType in PySpark)
PySpark provides SQL functions for handling list (array) columns inside DataFrames.
- Create Lists in a Column:
from pyspark.sql.functions import array, col df = df.withColumn("new_col", array(col("col1"), col("col2"))) # Creates a list [col1, col2]
- Extract Elements from Lists:
from pyspark.sql.functions import col df = df.withColumn("first_element", col("list_col")[0]) # Get first element of array column
- Check If a Value Exists in a List Column:
from pyspark.sql.functions import array_contains df = df.withColumn("has_value", array_contains(col("list_col"), "target_value"))
Handling Dictionaries (MapType in PySpark)
PySpark has a MapType
for handling dictionary-like data.
- Create a Dictionary Column:
from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import create_map, lit df = df.withColumn("dict_col", create_map(lit("key1"), col("value1"), lit("key2"), col("value2")))
- Extract Value from Dictionary:
df = df.withColumn("value1", col("dict_col")["key1"]) # Accessing a dictionary key in PySpark
- Check If Key Exists in a Dictionary Column:
from pyspark.sql.functions import map_keys df = df.withColumn("has_key", map_keys(col("dict_col")).contains("key1"))
2. List & Dictionary in Python Driver Code (Local Execution)
If you’re working outside the DataFrame (in normal Python code), you use Python’s built-in list and dictionary functions. These operations run on the driver node, not on Spark workers.
Example:
# Local execution (not distributed)
my_list = [1, 2, 3]
my_dict = {"key1": "value1", "key2": "value2"}
print(my_list[0]) # Output: 1
print(my_dict["key1"]) # Output: "value1"
3. Where Do These Executions Happen?
Operation | Execution Location | Example |
---|---|---|
Python lists/dicts | Local (Driver Node) | my_list[0] , my_dict["key"] |
PySpark array (list) | Distributed (Workers) | col("array_col")[0] |
PySpark map (dict) | Distributed (Workers) | col("map_col")["key"] |
4. Converting Between PySpark & Python Objects
If you need to process data using Python’s list/dict functions, you must collect the data to the driver (use with caution for large data!).
list_from_spark = df.select("array_col").rdd.map(lambda row: row[0]).collect() # Collect list column
dict_from_spark = df.select("map_col").rdd.map(lambda row: row[0]).collect() # Collect dict column
Final Takeaway
- Use PySpark functions (
array
,create_map
, etc.) for distributed list/dict processing. - Use Python’s built-in list/dict methods only in driver code.
- Avoid
collect()
on large DataFrames—it pulls data into the driver, defeating Spark’s distributed nature.
Discover more from HintsToday
Subscribe to get the latest posts sent to your email.