PySpark DataFrame groupBy
Operation
The groupBy
operation in PySpark is used to group data based on one or more columns, enabling you to perform various aggregation functions like count
, sum
, avg
, max
, min
, etc., on the grouped data.
Syntax
DataFrame.groupBy(*cols)
*cols
: One or more column names or expressions.
Common Aggregation Functions
count()
: Counts the number of rows in each group.sum(column)
: Computes the sum of a column for each group.avg(column)
: Calculates the average of a column for each group.max(column)
: Finds the maximum value in a column for each group.min(column)
: Finds the minimum value in a column for each group.agg(*exprs)
: Applies multiple aggregation functions.
Examples
Sample DataFrame
data = [
("Alice", "Sales", 3000),
("Bob", "Sales", 4000),
("Alice", "Sales", 4000),
("Catherine", "HR", 4000),
("David", "HR", 5000),
("Eve", "IT", 3000)
]
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
Output:
+----------+-----------+------+
| Name| Department|Salary|
+----------+-----------+------+
| Alice| Sales| 3000|
| Bob| Sales| 4000|
| Alice| Sales| 4000|
| Catherine| HR| 4000|
| David| HR| 5000|
| Eve| IT| 3000|
+----------+-----------+------+
1. Group By Single Column and Count
Count the number of rows in each department:
df.groupBy("Department").count().show()
Output:
+----------+-----+
|Department|count|
+----------+-----+
| HR| 2|
| IT| 1|
| Sales| 3|
+----------+-----+
2. Group By Multiple Columns and Count
Count rows grouped by both Department
and Name
:
df.groupBy("Department", "Name").count().show()
Output:
+----------+----------+-----+
|Department| Name|count|
+----------+----------+-----+
| Sales| Alice| 2|
| HR| David| 1|
| HR| Catherine| 1|
| IT| Eve| 1|
| Sales| Bob| 1|
+----------+----------+-----+
3. Group By with Sum Aggregation
Calculate the total salary for each department:
df.groupBy("Department").sum("Salary").show()
Output:
+----------+-----------+
|Department|sum(Salary)|
+----------+-----------+
| HR| 9000|
| IT| 3000|
| Sales| 11000|
+----------+-----------+
4. Group By with Multiple Aggregations
Perform multiple aggregations such as count, sum, average, max, and min:
df.groupBy("Department").agg(
count("*").alias("Total_Count"),
sum("Salary").alias("Total_Salary"),
avg("Salary").alias("Average_Salary"),
max("Salary").alias("Max_Salary"),
min("Salary").alias("Min_Salary")
).show()
Output:
+----------+-----------+-----------+--------------+-----------+-----------+
|Department|Total_Count|Total_Salary|Average_Salary|Max_Salary|Min_Salary|
+----------+-----------+-----------+--------------+-----------+-----------+
| HR| 2| 9000| 4500.0| 5000| 4000|
| IT| 1| 3000| 3000.0| 3000| 3000|
| Sales| 3| 11000| 3666.67| 4000| 3000|
+----------+-----------+-----------+--------------+-----------+-----------+
5. Using agg
Method with Expressions
The agg
method supports SQL-like expressions for aggregation:
from pyspark.sql.functions import expr
df.groupBy("Department").agg(
expr("count(*) as Total_Count"),
expr("sum(Salary) as Total_Salary"),
expr("avg(Salary) as Average_Salary"),
expr("max(Salary) as Max_Salary"),
expr("min(Salary) as Min_Salary")
).show()
Output: Same as above.
6. Group By with Having Clause
Filter groups based on aggregated results (e.g., total salary > 5000):
from pyspark.sql.functions import col
df.groupBy("Department").agg(
sum("Salary").alias("Total_Salary")
).filter(col("Total_Salary") > 5000).show()
Output:
+----------+-----------+
|Department|Total_Salary|
+----------+-----------+
| HR| 9000|
| Sales| 11000|
+----------+-----------+
Additional Use Cases
7. Aggregation with Window Functions
Combine groupBy
with window functions for advanced analysis:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("Rank", row_number().over(window_spec)).show()
8. Custom Aggregation
You can create custom aggregation logic using UDFs or Python functions:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
def custom_avg(salary):
return sum(salary) / len(salary)
custom_avg_udf = udf(custom_avg, DoubleType())
df.groupBy("Department").agg(custom_avg_udf("Salary").alias("Custom_Average")).show()
These examples illustrate how groupBy
can be used in diverse ways to aggregate and analyze data efficiently in PySpark.
Discover more from HintsToday
Subscribe to get the latest posts sent to your email.