Checking Schema, Data Types & Displaying Records with Conditions in PySpark π
1. Checking Schema & Data Types
Method 1: Using .printSchema()
Prints the schema of the DataFrame in a tree format.
df.printSchema()
Example Output:
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Department: string (nullable = true)
|-- Salary: double (nullable = true)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
for field in schema.fields:
print(f"Column: {field.name}, DataType: {field.dataType.simpleString()}")
#df.schema.fields: A list of StructField objects, each representing a column in your DataFrame.
#field.dataType: Retrieves the data type (like StringType(), IntegerType()) for that column.
#simpleString(): Converts the data type object to a simple, readable string (e.g., "string", "int").
Method 2: Using .dtypes
Returns a list of tuples containing column names and their data types.
df.dtypes
'''df.dtypes is an attribute of a PySpark DataFrame that returns a list of tuples, where each tuple contains the column name and its data type as a string.'''
Alternative Using df.schema.fields
You can also get a similar output to df.dtypes using:
[(field.name, field.dataType.simpleString()) for field in df.schema.fields]
Example Output:
[('ID', 'int'), ('Name', 'string'), ('Age', 'int'), ('Department', 'string'), ('Salary', 'double')]
Key Differences Between df.dtypes
and df.schema.fields
Feature | df.dtypes | df.schema.fields |
---|---|---|
Type of Output | List of tuples ([(col_name, dtype)] ) | List of StructField objects |
Data Type Format | String representation (e.g., "int" , "string" ) | Full DataType object (e.g., IntegerType() , StringType() ) |
Use Case | Quick lookup of column names & types | More detailed schema operations |
Method 3: Using .schema
Returns a StructType
object describing the schema.
df.schema
Example Output:
StructType([StructField('ID', IntegerType(), True),
StructField('Name', StringType(), True),
StructField('Age', IntegerType(), True),
StructField('Department', StringType(), True),
StructField('Salary', DoubleType(), True)])
2. Displaying Records
Method 1: .show()
Displays default 20 rows.
df.show()
Method 2: Show Specific Number of Rows
df.show(5)
Method 3: Show with Truncated Columns
df.show(truncate=False)
3. Filtering & Displaying Records with Multiple Conditions
Using .filter()
with Multiple Conditions
df_filtered = df.filter((df["Age"] > 30) & (df["Salary"] > 50000))
df_filtered.show()
Using .where()
(Same as .filter()
)
df_filtered = df.where((df["Age"] > 30) & (df["Department"] == "IT"))
df_filtered.show()
Using SQL Expressions (expr()
)
from pyspark.sql.functions import expr
df_filtered = df.filter(expr("Age > 30 AND Salary > 50000"))
df_filtered.show()
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, sum, avg, count, explode, row_number
from pyspark.sql.window import Window
# Initialize Spark Session
spark = SparkSession.builder.appName("PySpark_Useful_Functions").getOrCreate()
# Sample Data
data = [(1, "Alice", 25, "HR", 50000),
(2, "Bob", 30, "IT", 60000),
(3, "Charlie", 35, "IT", 70000),
(4, "David", 40, "Finance", 80000),
(5, "Eve", 45, "Finance", 90000)]
columns = ["ID", "Name", "Age", "Department", "Salary"]
df = spark.createDataFrame(data, columns)
# 1. Alias (Renaming Columns Temporarily)
df_alias = df.select(col("Name").alias("Full_Name"), col("Age"))
df_alias.show()
# 2. Distinct (Remove Duplicates)
df_distinct = df.select("Department").distinct()
df_distinct.show()
# 3. Filtering Data
df_filtered = df.filter((df["Age"] > 30) & (df["Department"] == "IT"))
df_filtered.show()
# 4. Adding & Modifying Columns
df_new = df.withColumn("New_Column", lit("DefaultValue"))
df_casted = df.withColumn("Salary", df["Salary"].cast("double"))
df_new.show()
df_casted.printSchema()
# 5. Aggregations (Sum, Average, Count)
df_grouped = df.groupBy("Department").agg(
sum("Salary").alias("Total_Salary"),
avg("Age").alias("Average_Age")
)
df_grouped.show()
# 6. Sorting
df_sorted = df.orderBy("Age", ascending=False)
df_sorted.show()
# 7. Joining DataFrames
extra_data = [(1, "US"), (2, "Canada"), (3, "UK"), (4, "Germany"), (5, "India")]
columns_extra = ["ID", "Country"]
df_extra = spark.createDataFrame(extra_data, columns_extra)
df_joined = df.join(df_extra, "ID", "inner")
df_joined.show()
# 8. Exploding Nested Data
df_nested = df.withColumn("Hobbies", lit("['Reading', 'Sports']"))
df_exploded = df_nested.withColumn("Hobby", explode(lit(["Reading", "Sports"])))
df_exploded.show()
# 9. Collecting Rows
rows = df.collect()
for row in rows:
print(row)
# 10. Row Numbering & Ranking
windowSpec = Window.partitionBy("Department").orderBy("Salary")
df_ranked = df.withColumn("Rank", row_number().over(windowSpec))
df_ranked.show()
# 11. Checking Schema & Data Types
df.printSchema()
print(df.dtypes)
print(df.schema)
# 12. Displaying Data with .show() Options
# Default show (20 rows, truncated at 20 characters)
df.show()
# Show a specific number of rows
df.show(5)
# Prevent truncation (full column display)
df.show(truncate=False)
# Adjust truncation length
df.show(truncate=50)
# Show data in vertical format
df.show(n=5, vertical=True)
# Combine options
df.show(n=10, truncate=40, vertical=True)
# Stop Spark Session
spark.stop()
The .show()
method in PySpark has multiple options to customize how the DataFrame is displayed. Hereβs a breakdown:
1. Basic Usage
df.show()
- Displays the first 20 rows by default.
- Truncates long strings to 20 characters.
2. Show a Specific Number of Rows
df.show(5) # Show first 5 rows
- Limits the output to 5 rows.
3. Prevent Truncation (Full Column Display)
df.show(truncate=False)
- By default, columns are truncated to 20 characters.
- Setting
truncate=False
displays the full content of each column.
β Example:
data = [(1, "A very long text that should not be truncated"),
(2, "Short text")]
df = spark.createDataFrame(data, ["ID", "Description"])
df.show(truncate=False)
Output:
+---+----------------------------------------------+
| ID | Description |
+---+----------------------------------------------+
| 1 | A very long text that should not be truncated |
| 2 | Short text |
+---+----------------------------------------------+
4. Adjust Column Truncation Length
df.show(truncate=50)
- Instead of truncating at 20 characters, this truncates at 50 characters.
5. Display with Row Indexes (vertical=True
)
df.show(n=5, vertical=True)
- Displays each row vertically, which is useful when there are too many columns.
β Example Output:
-RECORD 0------------------------
ID | 1
Name | Alice
Age | 25
Department | HR
Salary | 50000.0
-RECORD 1------------------------
ID | 2
Name | Bob
Age | 30
Department | IT
Salary | 60000.0
6. Combine Options
df.show(n=10, truncate=40, vertical=True)
- Shows 10 rows
- Truncates column data at 40 characters
- Displays data in vertical format
Summary of .show()
Options
Option | Description | Example |
---|---|---|
df.show() | Default (20 rows, truncates long values at 20 chars) | df.show() |
df.show(n) | Displays first n rows | df.show(5) |
df.show(truncate=False) | Displays full column values without truncation | df.show(truncate=False) |
df.show(truncate=n) | Truncates column values at n characters | df.show(truncate=50) |
df.show(vertical=True) | Displays rows vertically instead of in tabular format | df.show(vertical=True) |