Pandas vs. PySpark: Basic Operations
Loading Data
Creating DataFrames from various data sources is a common task in both Pandas and PySpark. Below are examples of how to create DataFrames from CSV files, JSON files, SQL databases, and directly from Python dictionaries for both Pandas and PySpark.
1. From CSV File
Pandas:
import pandas as pd
# Create DataFrame from CSV file
pdf = pd.read_csv('path/to/your/file.csv')
print(pdf)
PySpark:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Create DataFrame from CSV file
sdf = spark.read.csv('path/to/your/file.csv', header=True, inferSchema=True)
sdf.show()
2. From JSON File
Pandas:
import pandas as pd
# Create DataFrame from JSON file
pdf = pd.read_json('path/to/your/file.json')
print(pdf)
PySpark:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Create DataFrame from JSON file
sdf = spark.read.json('path/to/your/file.json')
sdf.show()
3. From SQL Database
To read data from an Oracle database in both PySpark and Pandas, you can use the appropriate database connectors and JDBC drivers. Below are the steps and code examples for both.
1. PySpark: Reading from Oracle Database
Pre-requisites:
- Install the Oracle JDBC driver and make sure it is accessible to your PySpark session.
- Place the driver
.jar
file in a location accessible by the Spark cluster (e.g.,spark/jars
directory).
Code Example:
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder \
.appName("ReadFromOracle") \
.config("spark.jars", "/path/to/ojdbc8.jar") \
.getOrCreate()
# Oracle database credentials
oracle_url = "jdbc:oracle:thin:@<host>:<port>/<service>"
oracle_properties = {
"user": "your_username",
"password": "your_password",
"driver": "oracle.jdbc.OracleDriver"
}
# Query or table to load
query = "(SELECT * FROM your_table WHERE rownum <= 1000) AS subquery"
# Read data from Oracle
df = spark.read \
.format("jdbc") \
.option("url", oracle_url) \
.option("dbtable", query) \
.option("user", oracle_properties["user"]) \
.option("password", oracle_properties["password"]) \
.option("driver", oracle_properties["driver"]) \
.load()
df.show()
2. Pandas: Reading from Oracle Database
Pre-requisites:
- Install the required libraries:
pip install cx_Oracle sqlalchemy pandas
Code Example:
import pandas as pd
from sqlalchemy import create_engine
# Oracle database credentials
oracle_host = "<host>"
oracle_port = "<port>"
oracle_service = "<service>"
oracle_user = "your_username"
oracle_password = "your_password"
# Create the connection string
oracle_connection_string = f"oracle+cx_oracle://{oracle_user}:{oracle_password}@{oracle_host}:{oracle_port}/?service_name={oracle_service}"
engine = create_engine(oracle_connection_string)
# Query to fetch data
query = "SELECT * FROM your_table WHERE ROWNUM <= 1000"
# Load data into a Pandas DataFrame
df = pd.read_sql(query, engine)
# Display data
print(df.head())
3. Key Differences: PySpark vs. Pandas
Feature | PySpark | Pandas |
---|---|---|
Use Case | Large-scale, distributed data handling | Smaller datasets that fit in memory |
Performance | Optimized for big data, distributed | Single-threaded, memory-bound |
Setup Complexity | Requires Spark setup and JDBC drivers | Simpler setup with cx_Oracle |
Integration | Scalable ETL workflows | Lightweight, ad-hoc queries |
4. From Python Dictionary
You can read data from dictionaries or lists of tuples in PySpark and Python using appropriate methods to convert them into a usable format (like a DataFrame). Here’s how:
1. Using PySpark
a) Reading a Dictionary
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("ReadFromDictionary").getOrCreate()
# Example dictionary
data = [
{"name": "Alice", "age": 25, "city": "New York"},
{"name": "Bob", "age": 30, "city": "Los Angeles"},
{"name": "Charlie", "age": 35, "city": "Chicago"}
]
# Create DataFrame
df = spark.createDataFrame(data)
df.show()
Output:
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
| Alice| 25| New York|
| Bob| 30| Los Angeles |
|Charlie| 35| Chicago|
+-------+---+-------------+
b) Reading a List of Tuples
# Example list of tuples
data = [("Alice", 25, "New York"), ("Bob", 30, "Los Angeles"), ("Charlie", 35, "Chicago")]
# Define schema
columns = ["name", "age", "city"]
# Create DataFrame
df = spark.createDataFrame(data, schema=columns)
df.show()
Output:
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
| Alice| 25| New York|
| Bob| 30| Los Angeles |
|Charlie| 35| Chicago|
+-------+---+-------------+
2. Using Python (Pandas)
a) Reading a Dictionary
import pandas as pd
# Example dictionary
data = [
{"name": "Alice", "age": 25, "city": "New York"},
{"name": "Bob", "age": 30, "city": "Los Angeles"},
{"name": "Charlie", "age": 35, "city": "Chicago"}
]
# Create DataFrame
df = pd.DataFrame(data)
print(df)
Output:
name age city
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
b) Reading a List of Tuples
# Example list of tuples
data = [("Alice", 25, "New York"), ("Bob", 30, "Los Angeles"), ("Charlie", 35, "Chicago")]
# Define column names
columns = ["name", "age", "city"]
# Create DataFrame
df = pd.DataFrame(data, columns=columns)
print(df)
Output:
name age city
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
- Pandas: Directly creates DataFrames from in-memory structures, files, and databases using a straightforward, intuitive syntax. Ideal for smaller datasets and interactive data analysis.
- PySpark: Handles large datasets and distributed computing efficiently. Requires a Spark session and uses a different set of functions for loading data from various sources.