Difference of scores for each subjects from Predecessor only for each subject in same class, difference of scores for each subject from successor only for each subject in same class , difference of total scores for a student from his predecessor and successor in a same class, also difference of scores from previous year for a student in SQL, Spark sql?
Let’s break down the requirements and then implement them in SQL and Spark SQL. We’ll assume we have a table with the following structure:
CREATE TABLE StudentScores (
StudentID INT,
Year INT,
Class INT,
Subject VARCHAR(50),
Score INT
);
Requirements
- Difference of scores for each subject from predecessor for each subject.
- Difference of scores for each subject from successor for each subject.
- Difference of total scores for a student from their predecessor and successor in the same class.
- Difference of scores from the previous year for a student.
SQL Implementation
Let’s assume we are working with a relational database supporting standard SQL. We’ll use window functions for some of these requirements.
1. Difference of scores for each subject from predecessor for each subject
SELECT
StudentID,
Year,
Class,
Subject,
Score,
Score - LAG(Score) OVER (PARTITION BY Class, Subject ORDER BY Year) AS DiffFromPredecessor
FROM
StudentScores;
2. Difference of scores for each subject from successor for each subject
SELECT
StudentID,
Year,
Class,
Subject,
Score,
LEAD(Score) OVER (PARTITION BY Class, Subject ORDER BY Year) - Score AS DiffFromSuccessor
FROM
StudentScores;
3. Difference of total scores for a student from their predecessor and successor in the same class
First, let’s compute the total scores per student per year:
WITH TotalScores AS (
SELECT
StudentID,
Year,
Class,
SUM(Score) AS TotalScore
FROM
StudentScores
GROUP BY
StudentID, Year, Class
)
SELECT
StudentID,
Year,
Class,
TotalScore,
TotalScore - LAG(TotalScore) OVER (PARTITION BY Class ORDER BY Year) AS DiffFromPredecessor,
LEAD(TotalScore) OVER (PARTITION BY Class ORDER BY Year) - TotalScore AS DiffFromSuccessor
FROM
TotalScores;
4. Difference of scores from the previous year for a student
SELECT
StudentID,
Year,
Class,
Subject,
Score,
Score - LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS DiffFromPreviousYear
FROM
StudentScores;
Spark SQL Implementation
The implementation in Spark SQL would be similar. Here’s how you can do it:
Setup
First, let’s create a DataFrame df
with the appropriate schema.
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F
spark = SparkSession.builder.appName("StudentScores").getOrCreate()
data = [
(1, 2023, 10, 'Math', 85),
(1, 2022, 10, 'Math', 80),
(2, 2023, 10, 'Math', 75),
(2, 2022, 10, 'Math', 70),
(1, 2023, 10, 'Science', 90),
(1, 2022, 10, 'Science', 85),
(2, 2023, 10, 'Science', 88),
(2, 2022, 10, 'Science', 84)
]
columns = ['StudentID', 'Year', 'Class', 'Subject', 'Score']
df = spark.createDataFrame(data, columns)
df.createOrReplaceTempView("StudentScores")
1. Difference of scores for each subject from predecessor for each subject
window_spec = Window.partitionBy('Class', 'Subject').orderBy('Year')
df.withColumn('DiffFromPredecessor', F.col('Score') - F.lag('Score').over(window_spec)).show()
2. Difference of scores for each subject from successor for each subject
df.withColumn('DiffFromSuccessor', F.lead('Score').over(window_spec) - F.col('Score')).show()
3. Difference of total scores for a student from their predecessor and successor in the same class
First, compute the total scores per student per year:
total_scores = df.groupBy('StudentID', 'Year', 'Class').agg(F.sum('Score').alias('TotalScore'))
total_scores.createOrReplaceTempView("TotalScores")
window_spec_total = Window.partitionBy('Class').orderBy('Year')
total_scores.withColumn('DiffFromPredecessor', F.col('TotalScore') - F.lag('TotalScore').over(window_spec_total))
.withColumn('DiffFromSuccessor', F.lead('TotalScore').over(window_spec_total) - F.col('TotalScore'))
.show()
4. Difference of scores from the previous year for a student
window_spec_student = Window.partitionBy('StudentID', 'Subject').orderBy('Year')
df.withColumn('DiffFromPreviousYear', F.col('Score') - F.lag('Score').over(window_spec_student)).show()
These implementations use window functions to calculate the required differences in both standard SQL and Spark SQL. Adjust column names and table names as per your actual schema.