Pyspark Dataframe programming - operations, functions, all statements, syntax with Examples

PySpark DataFrame API and PySpark SQL: A Comprehensive Guide

PySpark provides two powerful APIs for working with structured data: the DataFrame API and the Spark SQL API. Both APIs are essential for data analysis and processing in PySpark, offering flexibility and performance for handling large-scale datasets. Let’s explore these APIs, their use cases, and how they complement each other.

PySpark DataFrame API

The DataFrame API is a programmatic and Pythonic way to work with structured data in PySpark. It enables various transformations, actions, and aggregations directly on DataFrames.

Key Features of DataFrame API

High-level abstraction: DataFrames represent distributed collections of rows with named columns.
Rich operations: Support for selection, filtering, grouping, aggregation, joining, and transformations.
Integration: Seamlessly works with Spark SQL and other Spark libraries.

Creating DataFrames

From Lists or Rows:

from pyspark.sql import Row
from datetime import datetime, date

df = spark.createDataFrame([
    Row(a=1, b=2.0, c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3.0, c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0))
])
df.show()

With Explicit Schema:

df = spark.createDataFrame([
    (1, 2.0, 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3.0, 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0))
], schema="a long, b double, c string, d date, e timestamp")
df.show()

From Pandas DataFrame:

import pandas as pd
pandas_df = pd.DataFrame({
    'a': [1, 2],
    'b': [2.0, 3.0],
    'c': ['string1', 'string2'],
    'd': [date(2000, 1, 1), date(2000, 2, 1)],
    'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0)]
})
df = spark.createDataFrame(pandas_df)
df.show()

From an RDD:

rdd = spark.sparkContext.parallelize([
    (1, 2.0, 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3.0, 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0))
])
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])
df.show()

DataFrame Operations

Selection: df.select("a", "c").show()
Filtering: df.filter(df["b"] > 2.5).show()
Aggregation: from pyspark.sql import functions as F df.groupBy("c").agg(F.sum("b").alias("sum_b")).show()
Sorting: df.sort(df["b"].desc()).show()
Adding Columns: df.withColumn("new_col", df["b"] * 2).show()

PySpark SQL API

The Spark SQL API allows you to write SQL-like queries directly on PySpark DataFrames. It is particularly useful for users with SQL expertise.

Key Features of Spark SQL API

SQL syntax: Familiar SQL-like interface for querying structured data.
Interoperability: Easily switch between DataFrame API and SQL queries.
Catalog integration: Query external sources like Hive tables or JDBC.

Running SQL Queries

SparkSession for SQL: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Spark SQL Example").getOrCreate()
Registering DataFrame as Table: df.createOrReplaceTempView("my_table")
Executing SQL Queries: spark.sql("SELECT a, b FROM my_table WHERE b > 2.5").show()
Complex Queries: query = """ SELECT c, AVG(b) as avg_b FROM my_table GROUP BY c ORDER BY avg_b DESC """ spark.sql(query).show()

Combining DataFrame API and Spark SQL API

You can use both APIs together to leverage their strengths. For example:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Combined Example").getOrCreate()

# Create a DataFrame
data = [("Alice", 3000), ("Bob", 4000), ("Cathy", 5000)]
columns = ["Name", "Salary"]
df = spark.createDataFrame(data, schema=columns)

# DataFrame API
df = df.withColumn("Bonus", df["Salary"] * 0.1)

# Register as a SQL table
df.createOrReplaceTempView("employees")

# Spark SQL API
result_df = spark.sql("SELECT Name, Salary, Bonus, Salary + Bonus AS TotalCompensation FROM employees")
result_df.show()

# Stop Spark session
spark.stop()

When to Use Each API

DataFrame API: Ideal for programmatic data manipulation and chaining transformations.
Spark SQL API: Best for users familiar with SQL or for expressing complex queries in a concise manner.

Additional Resources

This flexibility ensures you can choose the right tool for each task, making PySpark a robust framework for large-scale data processing.

Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Pages: 1 2 3 4 5 6 7 8

Pages ( 8 of 8 ): « Previous 1 ... 6 7 8

Pyspark Dataframe programming – operations, functions, all statements, syntax with Examples