contact@hintstoday.comΒ  |Β  (123)-456-7890

Pyspark Dataframe programming – operations, functions, all statements, syntax with Examples

by lochan2014 | Jul 2, 2024 | Pyspark | 0 comments

Checking Schema, Data Types & Displaying Records with Conditions in PySpark πŸš€


1. Checking Schema & Data Types

Method 1: Using .printSchema()

Prints the schema of the DataFrame in a tree format.

df.printSchema()

Example Output:

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- Salary: double (nullable = true)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])
for field in schema.fields:
    print(f"Column: {field.name}, DataType: {field.dataType.simpleString()}")

#df.schema.fields: A list of StructField objects, each representing a column in your DataFrame.
#field.dataType: Retrieves the data type (like StringType(), IntegerType()) for that column.
#simpleString(): Converts the data type object to a simple, readable string (e.g., "string", "int").

Method 2: Using .dtypes

Returns a list of tuples containing column names and their data types.

df.dtypes

'''df.dtypes is an attribute of a PySpark DataFrame that returns a list of tuples, where each tuple contains the column name and its data type as a string.'''

Alternative Using df.schema.fields
You can also get a similar output to df.dtypes using:
[(field.name, field.dataType.simpleString()) for field in df.schema.fields]

Example Output:

[('ID', 'int'), ('Name', 'string'), ('Age', 'int'), ('Department', 'string'), ('Salary', 'double')]

Key Differences Between df.dtypes and df.schema.fields

Featuredf.dtypesdf.schema.fields
Type of OutputList of tuples ([(col_name, dtype)])List of StructField objects
Data Type FormatString representation (e.g., "int", "string")Full DataType object (e.g., IntegerType(), StringType())
Use CaseQuick lookup of column names & typesMore detailed schema operations

Method 3: Using .schema

Returns a StructType object describing the schema.

df.schema

Example Output:

StructType([StructField('ID', IntegerType(), True),
            StructField('Name', StringType(), True),
            StructField('Age', IntegerType(), True),
            StructField('Department', StringType(), True),
            StructField('Salary', DoubleType(), True)])

2. Displaying Records

Method 1: .show()

Displays default 20 rows.

df.show()

Method 2: Show Specific Number of Rows

df.show(5)

Method 3: Show with Truncated Columns

df.show(truncate=False)

3. Filtering & Displaying Records with Multiple Conditions

Using .filter() with Multiple Conditions

df_filtered = df.filter((df["Age"] > 30) & (df["Salary"] > 50000))
df_filtered.show()

Using .where() (Same as .filter())

df_filtered = df.where((df["Age"] > 30) & (df["Department"] == "IT"))
df_filtered.show()

Using SQL Expressions (expr())

from pyspark.sql.functions import expr

df_filtered = df.filter(expr("Age > 30 AND Salary > 50000"))
df_filtered.show()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, sum, avg, count, explode, row_number
from pyspark.sql.window import Window

# Initialize Spark Session
spark = SparkSession.builder.appName("PySpark_Useful_Functions").getOrCreate()

# Sample Data
data = [(1, "Alice", 25, "HR", 50000),
        (2, "Bob", 30, "IT", 60000),
        (3, "Charlie", 35, "IT", 70000),
        (4, "David", 40, "Finance", 80000),
        (5, "Eve", 45, "Finance", 90000)]
columns = ["ID", "Name", "Age", "Department", "Salary"]

df = spark.createDataFrame(data, columns)

# 1. Alias (Renaming Columns Temporarily)
df_alias = df.select(col("Name").alias("Full_Name"), col("Age"))
df_alias.show()

# 2. Distinct (Remove Duplicates)
df_distinct = df.select("Department").distinct()
df_distinct.show()

# 3. Filtering Data
df_filtered = df.filter((df["Age"] > 30) & (df["Department"] == "IT"))
df_filtered.show()

# 4. Adding & Modifying Columns
df_new = df.withColumn("New_Column", lit("DefaultValue"))
df_casted = df.withColumn("Salary", df["Salary"].cast("double"))
df_new.show()

df_casted.printSchema()

# 5. Aggregations (Sum, Average, Count)
df_grouped = df.groupBy("Department").agg(
    sum("Salary").alias("Total_Salary"),
    avg("Age").alias("Average_Age")
)
df_grouped.show()

# 6. Sorting
df_sorted = df.orderBy("Age", ascending=False)
df_sorted.show()

# 7. Joining DataFrames
extra_data = [(1, "US"), (2, "Canada"), (3, "UK"), (4, "Germany"), (5, "India")]
columns_extra = ["ID", "Country"]
df_extra = spark.createDataFrame(extra_data, columns_extra)
df_joined = df.join(df_extra, "ID", "inner")
df_joined.show()

# 8. Exploding Nested Data
df_nested = df.withColumn("Hobbies", lit("['Reading', 'Sports']"))
df_exploded = df_nested.withColumn("Hobby", explode(lit(["Reading", "Sports"])))
df_exploded.show()

# 9. Collecting Rows
rows = df.collect()
for row in rows:
    print(row)

# 10. Row Numbering & Ranking
windowSpec = Window.partitionBy("Department").orderBy("Salary")
df_ranked = df.withColumn("Rank", row_number().over(windowSpec))
df_ranked.show()

# 11. Checking Schema & Data Types
df.printSchema()
print(df.dtypes)
print(df.schema)

# 12. Displaying Data with .show() Options
# Default show (20 rows, truncated at 20 characters)
df.show()

# Show a specific number of rows
df.show(5)

# Prevent truncation (full column display)
df.show(truncate=False)

# Adjust truncation length
df.show(truncate=50)

# Show data in vertical format
df.show(n=5, vertical=True)

# Combine options
df.show(n=10, truncate=40, vertical=True)

# Stop Spark Session
spark.stop()

The .show() method in PySpark has multiple options to customize how the DataFrame is displayed. Here’s a breakdown:


1. Basic Usage

df.show()
  • Displays the first 20 rows by default.
  • Truncates long strings to 20 characters.

2. Show a Specific Number of Rows

df.show(5)  # Show first 5 rows
  • Limits the output to 5 rows.

3. Prevent Truncation (Full Column Display)

df.show(truncate=False)
  • By default, columns are truncated to 20 characters.
  • Setting truncate=False displays the full content of each column.

βœ… Example:

data = [(1, "A very long text that should not be truncated"),
        (2, "Short text")]
df = spark.createDataFrame(data, ["ID", "Description"])

df.show(truncate=False)

Output:

+---+----------------------------------------------+
| ID | Description                                |
+---+----------------------------------------------+
| 1  | A very long text that should not be truncated |
| 2  | Short text                                |
+---+----------------------------------------------+

4. Adjust Column Truncation Length

df.show(truncate=50)
  • Instead of truncating at 20 characters, this truncates at 50 characters.

5. Display with Row Indexes (vertical=True)

df.show(n=5, vertical=True)
  • Displays each row vertically, which is useful when there are too many columns.

βœ… Example Output:

-RECORD 0------------------------
 ID          | 1
 Name        | Alice
 Age         | 25
 Department  | HR
 Salary      | 50000.0
-RECORD 1------------------------
 ID          | 2
 Name        | Bob
 Age         | 30
 Department  | IT
 Salary      | 60000.0

6. Combine Options

df.show(n=10, truncate=40, vertical=True)
  • Shows 10 rows
  • Truncates column data at 40 characters
  • Displays data in vertical format

Summary of .show() Options

OptionDescriptionExample
df.show()Default (20 rows, truncates long values at 20 chars)df.show()
df.show(n)Displays first n rowsdf.show(5)
df.show(truncate=False)Displays full column values without truncationdf.show(truncate=False)
df.show(truncate=n)Truncates column values at n charactersdf.show(truncate=50)
df.show(vertical=True)Displays rows vertically instead of in tabular formatdf.show(vertical=True)

Written By

undefined

Related Posts

Submit a Comment

Your email address will not be published. Required fields are marked *