PySpark DataFrame groupBy Operation

The groupBy operation in PySpark is used to group data based on one or more columns, enabling you to perform various aggregation functions like count, sum, avg, max, min, etc., on the grouped data.


Syntax

DataFrame.groupBy(*cols)
  • *cols: One or more column names or expressions.

Common Aggregation Functions

  • count(): Counts the number of rows in each group.
  • sum(column): Computes the sum of a column for each group.
  • avg(column): Calculates the average of a column for each group.
  • max(column): Finds the maximum value in a column for each group.
  • min(column): Finds the minimum value in a column for each group.
  • agg(*exprs): Applies multiple aggregation functions.

Examples

Sample DataFrame

data = [
    ("Alice", "Sales", 3000),
    ("Bob", "Sales", 4000),
    ("Alice", "Sales", 4000),
    ("Catherine", "HR", 4000),
    ("David", "HR", 5000),
    ("Eve", "IT", 3000)
]

columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()

Output:

+----------+-----------+------+
|      Name| Department|Salary|
+----------+-----------+------+
|     Alice|      Sales|  3000|
|       Bob|      Sales|  4000|
|     Alice|      Sales|  4000|
| Catherine|         HR|  4000|
|     David|         HR|  5000|
|       Eve|         IT|  3000|
+----------+-----------+------+

1. Group By Single Column and Count

Count the number of rows in each department:

df.groupBy("Department").count().show()

Output:

+----------+-----+
|Department|count|
+----------+-----+
|        HR|    2|
|       IT|    1|
|     Sales|    3|
+----------+-----+

2. Group By Multiple Columns and Count

Count rows grouped by both Department and Name:

df.groupBy("Department", "Name").count().show()

Output:

+----------+----------+-----+
|Department|      Name|count|
+----------+----------+-----+
|      Sales|     Alice|    2|
|         HR|     David|    1|
|         HR| Catherine|    1|
|        IT|       Eve|    1|
|      Sales|       Bob|    1|
+----------+----------+-----+

3. Group By with Sum Aggregation

Calculate the total salary for each department:

df.groupBy("Department").sum("Salary").show()

Output:

+----------+-----------+
|Department|sum(Salary)|
+----------+-----------+
|        HR|       9000|
|       IT|       3000|
|     Sales|      11000|
+----------+-----------+

4. Group By with Multiple Aggregations

Perform multiple aggregations such as count, sum, average, max, and min:

df.groupBy("Department").agg(
    count("*").alias("Total_Count"),
    sum("Salary").alias("Total_Salary"),
    avg("Salary").alias("Average_Salary"),
    max("Salary").alias("Max_Salary"),
    min("Salary").alias("Min_Salary")
).show()

Output:

+----------+-----------+-----------+--------------+-----------+-----------+
|Department|Total_Count|Total_Salary|Average_Salary|Max_Salary|Min_Salary|
+----------+-----------+-----------+--------------+-----------+-----------+
|        HR|          2|       9000|        4500.0|       5000|       4000|
|       IT|          1|       3000|        3000.0|       3000|       3000|
|     Sales|          3|      11000|        3666.67|       4000|       3000|
+----------+-----------+-----------+--------------+-----------+-----------+

5. Using agg Method with Expressions

The agg method supports SQL-like expressions for aggregation:

from pyspark.sql.functions import expr

df.groupBy("Department").agg(
    expr("count(*) as Total_Count"),
    expr("sum(Salary) as Total_Salary"),
    expr("avg(Salary) as Average_Salary"),
    expr("max(Salary) as Max_Salary"),
    expr("min(Salary) as Min_Salary")
).show()

Output: Same as above.


6. Group By with Having Clause

Filter groups based on aggregated results (e.g., total salary > 5000):

from pyspark.sql.functions import col

df.groupBy("Department").agg(
    sum("Salary").alias("Total_Salary")
).filter(col("Total_Salary") > 5000).show()

Output:

+----------+-----------+
|Department|Total_Salary|
+----------+-----------+
|        HR|       9000|
|     Sales|      11000|
+----------+-----------+

Additional Use Cases

7. Aggregation with Window Functions

Combine groupBy with window functions for advanced analysis:

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("Rank", row_number().over(window_spec)).show()

8. Custom Aggregation

You can create custom aggregation logic using UDFs or Python functions:

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

def custom_avg(salary):
    return sum(salary) / len(salary)

custom_avg_udf = udf(custom_avg, DoubleType())

df.groupBy("Department").agg(custom_avg_udf("Salary").alias("Custom_Average")).show()

These examples illustrate how groupBy can be used in diverse ways to aggregate and analyze data efficiently in PySpark.



Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Pages ( 4 of 8 ): « Previous123 4 56 ... 8Next »

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading