Viewing Data in PySpark

When working with PySpark DataFrames, it’s essential to explore and understand the data. Below are the common methods to view data in PySpark with detailed examples and use cases.


1. show(): View the First Few Rows

Basic Usage

The show() method displays the first 20 rows by default.

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

data = [('Alice', 25), ('Bob', 30), ('Charlie', 35)]
columns = ['Name', 'Age']

sdf = spark.createDataFrame(data, columns)

# View the first 20 rows (default)
sdf.show()

# View the first 10 rows
sdf.show(10)

Use Case

  • Quickly inspect the structure and values of the dataset.
  • Useful when dealing with large datasets to get a glimpse of the data.

2. show() with truncate: Control Column Width

Control String Truncation

By default, show() truncates strings to 20 characters. You can control this behavior with the truncate parameter.

# Truncate long strings to 20 characters (default)
sdf.show(truncate=True)

# Do not truncate strings
sdf.show(truncate=False)

# Truncate strings to a specific length (e.g., 5 characters)
sdf.show(truncate=5)

Use Case

  • Helpful when columns contain long strings, such as descriptions or logs.
  • Avoids cluttering the output while inspecting data.

3. show() with vertical: Vertical Display of Rows

Display Rows Vertically

For better readability of rows with many columns or long values, use the vertical parameter.

# Vertical display of rows
sdf.show(vertical=True)

Use Case

  • Inspect datasets with many columns where horizontal display becomes impractical.

4. printSchema(): Print the Schema of the DataFrame

Inspect Schema

Displays the schema of the DataFrame, including column names, data types, and nullability.

# Print the schema
sdf.printSchema()

Example Output

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)

Use Case

  • Understand the structure of the dataset.
  • Verify data types and nullability before performing transformations.

5. describe(): Summary Statistics

Get Summary Statistics

Generates basic statistical summaries such as count, mean, stddev, min, and max.

# Summary statistics of DataFrame
sdf.describe().show()

Use Case

  • Analyze numerical columns.
  • Quickly assess the range and distribution of data.

6. head(): Retrieve the First Row or n Rows

Retrieve Rows

Returns the first row or the first n rows of the DataFrame.

# Retrieve the first row
print(sdf.head())

# Retrieve the first 5 rows
print(sdf.head(5))

Use Case

  • Useful for debugging and testing logic on a smaller sample.

7. take(): Retrieve the First n Rows

Retrieve Rows as a List

Returns the first n rows as a list of Row objects.

# Retrieve the first 5 rows as a list
rows = sdf.take(5)
for row in rows:
    print(row)

Use Case

  • Retrieve rows without collecting the entire dataset into memory.

8. collect(): Retrieve All Rows

Retrieve All Data

Fetches all rows as a list of Row objects.

# Retrieve all rows
all_rows = sdf.collect()
for row in all_rows:
    print(row)

Use Case

  • Fetch the entire dataset for operations where small data can fit in memory.
  • Be cautious with large datasets to avoid memory overflow.

Comparison of Methods

MethodDescriptionBest Use Case
show()Display rows with options for truncation and vertical format.Quickly inspect data visually.
printSchema()Print DataFrame schema.Verify column names, data types, and nullability.
describe()Get summary statistics.Analyze numerical columns’ distribution.
head()Retrieve the first n rows.Debug or test logic on a small sample.
take()Fetch n rows as a list.Retrieve rows for iterative operations without full collect.
collect()Fetch all rows as a list.Perform operations on datasets that fit in memory.

Practical Use Cases

  1. Exploratory Data Analysis (EDA):
    • Use show() and describe() to understand the data distribution.
  2. Schema Validation:
    • Use printSchema() to validate column data types before transformations.
  3. Debugging:
    • Fetch specific rows with head() or take() to test logic.
  4. Data Export:
    • Use collect() to gather the entire dataset for export or further processing.

These methods provide flexible options for exploring and debugging PySpark DataFrames effectively.



Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Pages ( 2 of 8 ): « Previous1 2 34 ... 8Next »

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading