contact@hintstoday.com  |  (123)-456-7890

Useful Code Snippets in Python and Pyspark

by lochan2014 | Jan 7, 2025 | Pyspark, Python | 0 comments

PySpark and Python dictionaries (or other data structures like lists, sets, and tuples) serve different purposes. Let’s explore their use cases and best scenarios for usage.


1️⃣ PySpark DataFrame vs. Python Dictionary/List

FeaturePySpark DataFramePython Dictionary/List
Best forBig Data ProcessingSmall Data Processing
ScalabilityHandles large-scale distributed dataWorks in-memory, limited by RAM
Parallel ProcessingDistributed across clusters (Spark Executors)Runs on a single machine
OperationsSQL-like queries, transformations, aggregationsFast lookup (dict), sequential iteration (list)
IntegrationWorks with Spark SQL, MLlib, DataFrame APINative Python, NumPy, Pandas, JSON

2️⃣ When to Use PySpark?

🔹 For Big Data Processing

  • Large datasets that cannot fit in memory
  • Distributed processing across clusters
  • ETL (Extract, Transform, Load) workflows
  • SQL-like queries on structured/unstructured data

Example: Processing Millions of Rows

df = spark.read.csv("large_data.csv", header=True, inferSchema=True)
df.groupBy("category").agg(F.avg("price")).show()

3️⃣ When to Use Python Dictionaries?

🔹 For Small, Fast Lookups

  • Storing key-value pairs in-memory
  • Caching precomputed results
  • Quick lookup for transformations

Example: Mapping a dictionary to a column in PySpark

mapping_dict = {"A": "Category 1", "B": "Category 2", "C": "Category 3"}
mapping_expr = F.create_map([F.lit(x) for x in sum(mapping_dict.items(), ())])

df = df.withColumn("category_name", mapping_expr[F.col("category")])
df.show()

💡 Why? Using a dictionary in memory avoids expensive joins.


4️⃣ When to Use Python Lists?

🔹 For Ordered Sequences

  • Storing a list of items (e.g., column names)
  • Iterating through small data collections

Example: Selecting Multiple Columns

columns = ["name", "age", "salary"]
df.select(*columns).show()

5️⃣ When to Use Python Sets?

🔹 For Fast Membership Checks

  • Checking if a value exists
  • Removing duplicates

Example: Filtering a PySpark DataFrame with a Set

valid_categories = {"Electronics", "Clothing", "Books"}
df = df.filter(F.col("category").isin(valid_categories))

💡 Why? A set lookup is faster than checking against a list.


6️⃣ Converting Between PySpark & Python Data Structures

PySpark → Python Dictionary

result_dict = df.limit(5).rdd.collectAsMap()
print(result_dict)

Python Dictionary → PySpark DataFrame

data = [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]
df = spark.createDataFrame(data)
df.show()

🔥 Summary: When to Use What?

Use CaseBest Choice
Large-scale data processingPySpark DataFrame
Small key-value lookupsPython Dictionary
Ordered sequence operationsPython List
Fast membership checksPython Set
Small structured dataPandas DataFrame

Let’s break this down step by step:


🔹 Problem Statement

You have a PySpark DataFrame with a column (e.g., "category") containing values like "A", "B", and "C". You want to map these values to human-readable names using a Python dictionary.

Example Input DataFrame (df):

category
A
B
C
A

Mapping Dictionary:

mapping_dict = {"A": "Category 1", "B": "Category 2", "C": "Category 3"}

Goal: Add a new column "category_name" where "A" is replaced with "Category 1", "B" with "Category 2", etc.


🔹 Step-by-Step Explanation

from pyspark.sql import functions as F

# Step 1: Define a mapping dictionary
mapping_dict = {"A": "Category 1", "B": "Category 2", "C": "Category 3"}

This dictionary contains key-value pairs where:

  • "A" maps to "Category 1"
  • "B" maps to "Category 2"
  • "C" maps to "Category 3"

🔹 Step 2: Convert Dictionary into PySpark Expression

mapping_expr = F.create_map([F.lit(x) for x in sum(mapping_dict.items(), ())])

Breaking It Down:

  1. mapping_dict.items()
    • mapping_dict.items() returns a list of key-value pairs: dict_items([('A', 'Category 1'), ('B', 'Category 2'), ('C', 'Category 3')])
  2. sum(mapping_dict.items(), ())
    • sum(iterable, start) is a trick to flatten a list of tuples into a single tuple.
    • It converts: [('A', 'Category 1'), ('B', 'Category 2'), ('C', 'Category 3')] into: ('A', 'Category 1', 'B', 'Category 2', 'C', 'Category 3')
  3. [F.lit(x) for x in ...]
    • F.lit(x) converts each element into a PySpark literal (a constant value).
    • So, F.create_map() gets key-value pairs as PySpark expressions.
  4. Final mapping_expr Output:
    • Creates a PySpark map column expression like: create_map(lit("A"), lit("Category 1"), lit("B"), lit("Category 2"), lit("C"), lit("Category 3"))
    • This is a PySpark SQL map literal, which behaves like a dictionary.

🔹 Step 3: Apply Mapping to DataFrame

df = df.withColumn("category_name", mapping_expr[F.col("category")])
  • F.col("category") selects the "category" column.
  • mapping_expr[F.col("category")] retrieves the mapped value for each row.
  • .withColumn("category_name", ...) adds the mapped values as a new column.

🔹 Final Output

categorycategory_name
ACategory 1
BCategory 2
CCategory 3
ACategory 1

Now, your DataFrame has a "category_name" column based on the dictionary mappings.


🔥 Why Use This Approach Instead of join?

  1. Faster for Small Dictionaries
    • If the mapping dictionary is small, using F.create_map() is more efficient than performing a DataFrame .join(), which requires shuffling.
  2. Avoids Broadcast Joins
    • .join() might need broadcasting for performance optimization, but F.create_map() is lightweight.
  3. Works Dynamically
    • You can modify mapping_dict anytime, and it will update the transformation.

🚀 Alternative Approach: Using F.when()

If your mapping dictionary is small, you can also use F.when():

df = df.withColumn(
    "category_name",
    F.when(F.col("category") == "A", "Category 1")
     .when(F.col("category") == "B", "Category 2")
     .when(F.col("category") == "C", "Category 3")
     .otherwise("Unknown")
)

🔴 Disadvantage: Not scalable for large mappings.


✅ Summary

ApproachBest Use Case
F.create_map()When mapping a small dictionary in a PySpark DataFrame
.join() with another DataFrameWhen mapping a large dataset
F.when()When dealing with few conditions (3-5 values)

Written By

undefined

Related Posts

Python Programming Language Specials

Python is a popular high-level, interpreted programming language known for its readability and ease of use. Python was invented by Guido Van Rossum and it was first released in February, 1991. The name python is inspired from Monte Python Flying Circus,…

Read More

Submit a Comment

Your email address will not be published. Required fields are marked *