1.To find the second-highest salary for each department in SQL?
Here are multiple methods to find the second-highest salary for each department in SQL, using different approaches.
1. Using Correlated Subquery
This approach involves using a subquery to find the highest salary for each department and then excluding it.
SELECT Department, MAX(Salary) AS SecondHighestSalary
FROM Employee e1
WHERE Salary < (SELECT MAX(Salary)
FROM Employee e2
WHERE e1.Department = e2.Department)
GROUP BY Department;
The subquery in SQL statement:
(SELECT MAX(Salary)
FROM Employee e2
WHERE e1.Department = e2.Department)
is calculating the maximum salary within the same department for each employee’s department in the outer query. This subquery is evaluated for each department that the employee e1
belongs to.
Explanation:
- Outer Query (
e1
): The outer query is grouped by theDepartment
and is responsible for finding the second-highest salary in each department. - Subquery (
e2
): The subquery is used to find the maximum salary in the department that the current row (from the outer query) belongs to. TheWHERE e1.Department = e2.Department
clause ensures that the subquery’sMAX(Salary)
is only considering salaries within the same department as the current employee ine1
.
To Clarify:
MAX(Salary)
in Subquery: TheMAX(Salary)
is calculated for each department individually, not for the entire table. This happens because the subquery is correlated with the outer query by the department condition (e1.Department = e2.Department
).- Execution Flow: For each row in the outer query, the subquery finds the maximum salary for the department of that row. The outer query then checks if the employee’s salary is less than this maximum salary and identifies the maximum salary among those that are less than the highest salary (which is the second-highest salary).
Example:
Consider the following data in the Employee
table:
EmployeeID | Department | Salary |
---|---|---|
1 | HR | 5000 |
2 | HR | 7000 |
3 | HR | 6000 |
4 | IT | 8000 |
5 | IT | 12000 |
6 | IT | 9000 |
For the HR
department:
- The subquery for each
e1
inHR
will compute the maximum salary (7000) withinHR
. - The outer query will then find the highest salary less than 7000, which is 6000.
For the IT
department:
- The subquery for each
e1
inIT
will compute the maximum salary (12000) withinIT
. - The outer query will then find the highest salary less than 12000, which is 9000.
The final result will show the second-highest salary per department:
Department | SecondHighestSalary |
---|---|
HR | 6000 |
IT | 9000 |
2. Using Window Functions
Window functions provide a powerful way to solve this problem.
WITH RankedSalaries AS (
SELECT
Department,
Salary,
DENSE_RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS Rank
FROM Employee
)
SELECT
Department,
Salary AS SecondHighestSalary
FROM RankedSalaries
WHERE Rank = 2;
3. Using Common Table Expressions (CTE) and ROW_NUMBER()
This approach uses a CTE to rank salaries and then selects the second-highest salary.
WITH RankedSalaries AS (
SELECT
Department,
Salary,
ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) AS RowNum
FROM Employee
)
SELECT
Department,
Salary AS SecondHighestSalary
FROM RankedSalaries
WHERE RowNum = 2;
4. Using JOIN and Subquery
This method uses a join and subquery to find the second-highest salary for each department.
SELECT e1.Department, MAX(e1.Salary) AS SecondHighestSalary
FROM Employee e1
JOIN Employee e2 ON e1.Department = e2.Department AND e1.Salary < e2.Salary
GROUP BY e1.Department
HAVING COUNT(DISTINCT e2.Salary) = 1;
5. Using LIMIT and OFFSET (MySQL/MariaDB Specific)
If you’re using a database that supports LIMIT
and OFFSET
, such as MySQL or MariaDB, you can use them to directly fetch the second-highest salary.
SELECT Department, Salary AS SecondHighestSalary
FROM (
SELECT Department, Salary
FROM Employee
ORDER BY Department, Salary DESC
LIMIT 2 OFFSET 1
) AS SecondHighest
GROUP BY Department;
6. Using DISTINCT and TOP (SQL Server Specific)
In SQL Server, you can use DISTINCT
with TOP
and ORDER BY
to achieve this.
SELECT Department, MAX(Salary) AS SecondHighestSalary
FROM (
SELECT DISTINCT Salary, Department
FROM Employee
) AS DistinctSalaries
WHERE Salary < (SELECT MAX(Salary)
FROM Employee e2
WHERE DistinctSalaries.Department = e2.Department)
GROUP BY Department;
7. Using CROSS APPLY (SQL Server Specific)
SQL Server’s CROSS APPLY
can be useful for this kind of problem.
SELECT Department, MAX(Salary) AS SecondHighestSalary
FROM (
SELECT e1.Department, e1.Salary
FROM Employee e1
CROSS APPLY (
SELECT TOP 2 Salary
FROM Employee e2
WHERE e1.Department = e2.Department
ORDER BY Salary DESC
) AS TopSalaries
WHERE TopSalaries.Salary != e1.Salary
) AS SecondHighestSalaries
GROUP BY Department;
Example Table and Query Results
Assuming we have the following Employee
table:
EmployeeID | Name | Department | Salary |
---|---|---|---|
1 | Alice | HR | 70000 |
2 | Bob | HR | 80000 |
3 | Charlie | IT | 60000 |
4 | David | IT | 75000 |
5 | Eve | IT | 90000 |
6 | Frank | Sales | 50000 |
7 | Grace | Sales | 60000 |
8 | Helen | Sales | 55000 |
Each method would yield:
Department | SecondHighestSalary |
---|---|
HR | 70000 |
IT | 75000 |
Sales | 55000 |
These methods provide various ways to find the second-highest salary for each department using different SQL techniques and functions. Choose the one that best fits your SQL dialect and requirements.
Q2. Difference of scores for each subjects from Predecessor year for each subject ,difference of scores for each subject from successor year for each subject , difference of total scores for each student from predecessor and successor from previous year for a student in SQL, Spark sql?
let’s break this down into different parts and provide solutions in both standard SQL and Spark SQL. We’ll assume we have a table Scores
with columns StudentID
, Year
, Subject
, and Score
.
1. Difference of Scores for Each Subject from Predecessor
For each student, calculate the difference in scores for each subject from the previous year (predecessor).
WITH RankedScores AS (
SELECT
StudentID,
Subject,
Year,
Score,
LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS PrevScore
FROM Scores
)
SELECT
StudentID,
Subject,
Year,
Score,
Score - PrevScore AS ScoreDiffFromPredecessor
FROM RankedScores
WHERE PrevScore IS NOT NULL;
Spark SQL
SELECT
StudentID,
Subject,
Year,
Score,
Score - LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS ScoreDiffFromPredecessor
FROM Scores
WHERE LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) IS NOT NULL;
2. Difference of Scores for Each Subject from Successor
For each student, calculate the difference in scores for each subject from the next year (successor).
SQL
sqlCopy codeWITH RankedScores AS (
SELECT
StudentID,
Subject,
Year,
Score,
LEAD(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS NextScore
FROM Scores
)
SELECT
StudentID,
Subject,
Year,
Score,
Score - NextScore AS ScoreDiffFromSuccessor
FROM RankedScores
WHERE NextScore IS NOT NULL;
Spark SQL
SELECT
StudentID,
Subject,
Year,
Score,
Score - LEAD(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS ScoreDiffFromSuccessor
FROM Scores
WHERE LEAD(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) IS NOT NULL;
3. Difference of Total Scores for Each Student from Predecessor and Successor
For each student, calculate the difference in total scores from the previous and next year.
sqlCopy codeWITH TotalScores AS (
SELECT
StudentID,
Year,
SUM(Score) AS TotalScore
FROM Scores
GROUP BY StudentID, Year
),
RankedTotalScores AS (
SELECT
StudentID,
Year,
TotalScore,
LAG(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) AS PrevTotalScore,
LEAD(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) AS NextTotalScore
FROM TotalScores
)
SELECT
StudentID,
Year,
TotalScore,
TotalScore - PrevTotalScore AS TotalScoreDiffFromPredecessor,
TotalScore - NextTotalScore AS TotalScoreDiffFromSuccessor
FROM RankedTotalScores
WHERE PrevTotalScore IS NOT NULL OR NextTotalScore IS NOT NULL;
Spark SQL
WITH TotalScores AS (
SELECT
StudentID,
Year,
SUM(Score) AS TotalScore
FROM Scores
GROUP BY StudentID, Year
)
SELECT
StudentID,
Year,
TotalScore,
TotalScore - LAG(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) AS TotalScoreDiffFromPredecessor,
TotalScore - LEAD(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) AS TotalScoreDiffFromSuccessor
FROM TotalScores
WHERE LAG(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) IS NOT NULL
OR LEAD(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) IS NOT NULL;
4. Difference of Scores from Previous Year for a Student
For each student, calculate the difference in scores from the previous year.
SQL
WITH RankedScores AS (
SELECT
StudentID,
Year,
SUM(Score) AS TotalScore,
LAG(SUM(Score)) OVER (PARTITION BY StudentID ORDER BY Year) AS PrevTotalScore
FROM Scores
GROUP BY StudentID, Year
)
SELECT
StudentID,
Year,
TotalScore,
TotalScore - PrevTotalScore AS ScoreDiffFromPreviousYear
FROM RankedScores
WHERE PrevTotalScore IS NOT NULL;
Spark SQL
SELECT
StudentID,
Year,
SUM(Score) AS TotalScore,
SUM(Score) - LAG(SUM(Score)) OVER (PARTITION BY StudentID ORDER BY Year) AS ScoreDiffFromPreviousYear
FROM Scores
GROUP BY StudentID, Year
HAVING LAG(SUM(Score)) OVER (PARTITION BY StudentID ORDER BY Year) IS NOT NULL;
Explanation
- LAG(): Retrieves the value of a column from a preceding row within the same partition.
- LEAD(): Retrieves the value of a column from a following row within the same partition.
- PARTITION BY: Divides the result set into partitions to which the window function is applied.
- ORDER BY: Defines the logical order of the rows within each partition.
Q3.Difference of scores for each subjects from Predecessor only for each subject in same class, difference of scores for each subject from successor only for each subject in same class , difference of total scores for a student from his predecessor and successor in a same class, also difference of scores from previous year for a student in SQL, Spark sql?
Let’s break down the requirements and then implement them in SQL and Spark SQL. We’ll assume we have a table with the following structure:
CREATE TABLE StudentScores (
StudentID INT,
Year INT,
Class INT,
Subject VARCHAR(50),
Score INT
);
Requirements
- Difference of scores for each subject from predecessor for each subject.
- Difference of scores for each subject from successor for each subject.
- Difference of total scores for a student from their predecessor and successor in the same class.
- Difference of scores from the previous year for a student.
SQL Implementation
Let’s assume we are working with a relational database supporting standard SQL. We’ll use window functions for some of these requirements.
1. Difference of scores for each subject from predecessor for each subject
SELECT
StudentID,
Year,
Class,
Subject,
Score,
Score - LAG(Score) OVER (PARTITION BY Class, Subject ORDER BY Year) AS DiffFromPredecessor
FROM
StudentScores;
2. Difference of scores for each subject from successor for each subject
SELECT
StudentID,
Year,
Class,
Subject,
Score,
LEAD(Score) OVER (PARTITION BY Class, Subject ORDER BY Year) - Score AS DiffFromSuccessor
FROM
StudentScores;
3. Difference of total scores for a student from their predecessor and successor in the same class
First, let’s compute the total scores per student per year:
WITH TotalScores AS (
SELECT
StudentID,
Year,
Class,
SUM(Score) AS TotalScore
FROM
StudentScores
GROUP BY
StudentID, Year, Class
)
SELECT
StudentID,
Year,
Class,
TotalScore,
TotalScore - LAG(TotalScore) OVER (PARTITION BY Class ORDER BY Year) AS DiffFromPredecessor,
LEAD(TotalScore) OVER (PARTITION BY Class ORDER BY Year) - TotalScore AS DiffFromSuccessor
FROM
TotalScores;
4. Difference of scores from the previous year for a student
SELECT
StudentID,
Year,
Class,
Subject,
Score,
Score - LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS DiffFromPreviousYear
FROM
StudentScores;
Spark SQL Implementation
The implementation in Spark SQL would be similar. Here’s how you can do it:
Setup
First, let’s create a DataFrame df
with the appropriate schema.
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F
spark = SparkSession.builder.appName("StudentScores").getOrCreate()
data = [
(1, 2023, 10, 'Math', 85),
(1, 2022, 10, 'Math', 80),
(2, 2023, 10, 'Math', 75),
(2, 2022, 10, 'Math', 70),
(1, 2023, 10, 'Science', 90),
(1, 2022, 10, 'Science', 85),
(2, 2023, 10, 'Science', 88),
(2, 2022, 10, 'Science', 84)
]
columns = ['StudentID', 'Year', 'Class', 'Subject', 'Score']
df = spark.createDataFrame(data, columns)
df.createOrReplaceTempView("StudentScores")
1. Difference of scores for each subject from predecessor for each subject
window_spec = Window.partitionBy('Class', 'Subject').orderBy('Year')
df.withColumn('DiffFromPredecessor', F.col('Score') - F.lag('Score').over(window_spec)).show()
2. Difference of scores for each subject from successor for each subject
df.withColumn('DiffFromSuccessor', F.lead('Score').over(window_spec) - F.col('Score')).show()
3. Difference of total scores for a student from their predecessor and successor in the same class
First, compute the total scores per student per year:
total_scores = df.groupBy('StudentID', 'Year', 'Class').agg(F.sum('Score').alias('TotalScore'))
total_scores.createOrReplaceTempView("TotalScores")
window_spec_total = Window.partitionBy('Class').orderBy('Year')
total_scores.withColumn('DiffFromPredecessor', F.col('TotalScore') - F.lag('TotalScore').over(window_spec_total))
.withColumn('DiffFromSuccessor', F.lead('TotalScore').over(window_spec_total) - F.col('TotalScore'))
.show()
4. Difference of scores from the previous year for a student
window_spec_student = Window.partitionBy('StudentID', 'Subject').orderBy('Year')
df.withColumn('DiffFromPreviousYear', F.col('Score') - F.lag('Score').over(window_spec_student)).show()
These implementations use window functions to calculate the required differences in both standard SQL and Spark SQL. Adjust column names and table names as per your actual schema.
4.Rolling Sum and Nth salary based questions in MySQL or Spark SQL
Here are examples of how to perform rolling sums and find the nth salary using both MySQL and Spark SQL.
Rolling Sum
MySQL
To calculate a rolling sum in MySQL, you can use the SUM()
function with a window frame specification.
-- Create a sample table
CREATE TABLE salaries (
id INT PRIMARY KEY,
employee_name VARCHAR(50),
salary INT,
transaction_date DATE
);
-- Insert sample data
INSERT INTO salaries (id, employee_name, salary, transaction_date) VALUES
(1, 'Alice', 5000, '2023-01-01'),
(2, 'Bob', 6000, '2023-01-02'),
(3, 'Charlie', 7000, '2023-01-03'),
(4, 'Alice', 8000, '2023-01-04'),
(5, 'Bob', 9000, '2023-01-05');
-- Rolling sum query
SELECT
id,
employee_name,
salary,
transaction_date,
SUM(salary) OVER (ORDER BY transaction_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_sum
FROM
salaries
ORDER BY
transaction_date;
Spark SQL
To calculate a rolling sum in Spark SQL, use the SUM()
function with a window specification.
-- Create a sample table
CREATE TABLE salaries (
id INT,
employee_name STRING,
salary INT,
transaction_date DATE
);
-- Insert sample data
INSERT INTO salaries VALUES
(1, 'Alice', 5000, '2023-01-01'),
(2, 'Bob', 6000, '2023-01-02'),
(3, 'Charlie', 7000, '2023-01-03'),
(4, 'Alice', 8000, '2023-01-04'),
(5, 'Bob', 9000, '2023-01-05');
-- Rolling sum query
SELECT
id,
employee_name,
salary,
transaction_date,
SUM(salary) OVER (ORDER BY transaction_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_sum
FROM
salaries
ORDER BY
transaction_date;
Nth Salary
MySQL
To find the nth highest salary in MySQL, you can use the DENSE_RANK()
function.
-- Nth highest salary query (e.g., 3rd highest salary)
SET @N = 3;
SELECT DISTINCT
salary
FROM
salaries
ORDER BY
salary DESC
LIMIT 1 OFFSET @N - 1;
Alternatively, using DENSE_RANK()
:
WITH RankedSalaries AS (
SELECT
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rank
FROM
salaries
)
SELECT
salary
FROM
RankedSalaries
WHERE
rank = @N;
Spark SQL
To find the nth highest salary in Spark SQL, you can use the DENSE_RANK()
function.
-- Nth highest salary query (e.g., 3rd highest salary)
SET @N = 3;
SELECT DISTINCT
salary
FROM
salaries
ORDER BY
salary DESC
LIMIT @N;
Alternatively, using DENSE_RANK()
:
WITH RankedSalaries AS (
SELECT
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rank
FROM
salaries
)
SELECT
salary
FROM
RankedSalaries
WHERE
rank = @N;
- Rolling Sum: Both MySQL and Spark SQL use the
SUM()
function with a window specification to calculate a rolling sum. - Nth Salary: Both MySQL and Spark SQL can use the
DENSE_RANK()
function to find the nth highest salary. TheLIMIT
andOFFSET
combination can also be used in MySQL for simpler cases.
These queries illustrate the power of SQL window functions in both MySQL and Spark SQL for advanced data processing tasks.
5.Mysql Query to find employees who earn more than their managers
To find employees who earn more than their managers, we need to assume that we have a hierarchical table structure where each employee has a reference to their manager’s ID. Below is an example schema and the corresponding MySQL query:
Schema Example
Assume we have the following table employees
:
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50),
salary DECIMAL(10, 2),
manager_id INT,
FOREIGN KEY (manager_id) REFERENCES employees(id)
);
Sample Data
INSERT INTO employees (id, name, salary, manager_id) VALUES
(1, 'Alice', 90000, NULL), -- Alice is the CEO, no manager
(2, 'Bob', 70000, 1), -- Bob reports to Alice
(3, 'Charlie', 80000, 1), -- Charlie reports to Alice
(4, 'David', 60000, 2), -- David reports to Bob
(5, 'Eva', 75000, 2), -- Eva reports to Bob
(6, 'Frank', 85000, 3), -- Frank reports to Charlie
(7, 'Grace', 50000, 3); -- Grace reports to Charlie
Query to Find Employees Who Earn More Than Their Managers
SELECT e.id, e.name, e.salary, e.manager_id, m.name AS manager_name, m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
Explanation
- FROM employees e: This selects all employees as
e
. - JOIN employees m ON e.manager_id = m.id: This joins the
employees
table with itself to get each employee’s manager.m
is an alias for the manager. - WHERE e.salary > m.salary: This filters the results to include only those employees whose salary is greater than their manager’s salary.
Result
Running the query on the sample data will produce the employees who earn more than their managers. The expected result for the provided sample data is:
id | name | salary | manager_id | manager_name | manager_salary |
---|---|---|---|---|---|
3 | Charlie | 80000 | 1 | Alice | 90000 |
5 | Eva | 75000 | 2 | Bob | 70000 |
6 | Frank | 85000 | 3 | Charlie | 80000 |
6.Finding Employees with the Highest Salary in Each Department
MySQL:
SELECT department_id, employee_id, salary
FROM (
SELECT department_id, employee_id, salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees
) ranked_employees
WHERE rank = 1;
Spark SQL:
SELECT department_id, employee_id, salary
FROM (
SELECT department_id, employee_id, salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees
) ranked_employees
WHERE rank = 1;
7. Finding the Second Highest Salary in the Company
MySQL:
SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
Spark SQL:
SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
8. Finding Employees with More Than Average Salary in Their Department
MySQL:
SELECT e.employee_id, e.name, e.salary, e.department_id
FROM employees e
JOIN (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
) d ON e.department_id = d.department_id
WHERE e.salary > d.avg_salary;
Spark SQL:
SELECT e.employee_id, e.name, e.salary, e.department_id
FROM employees e
JOIN (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
) d ON e.department_id = d.department_id
WHERE e.salary > d.avg_salary;
9. Finding the Cumulative Salary by Department Ordered by Employee ID
MySQL:
SELECT employee_id, department_id, salary,
SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id) AS cumulative_salary
FROM employees;
Spark SQL:
SELECT employee_id, department_id, salary,
SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id) AS cumulative_salary
FROM employees;
10. Finding the Manager with the Maximum Number of Direct Reports
MySQL:
SELECT manager_id, COUNT(*) AS direct_reports
FROM employees
WHERE manager_id IS NOT NULL
GROUP BY manager_id
ORDER BY direct_reports DESC
LIMIT 1;
Spark SQL:
SELECT manager_id, COUNT(*) AS direct_reports
FROM employees
WHERE manager_id IS NOT NULL
GROUP BY manager_id
ORDER BY direct_reports DESC
LIMIT 1;
11. Identifying Gaps in Employee IDs
MySQL:
SELECT id + 1 AS start_id, MIN(id) - 1 AS end_id
FROM employees
WHERE (id + 1) NOT IN (SELECT id FROM employees)
GROUP BY id
ORDER BY start_id;
Spark SQL:
SELECT id + 1 AS start_id, MIN(id) - 1 AS end_id
FROM employees
WHERE (id + 1) NOT IN (SELECT id FROM employees)
GROUP BY id
ORDER BY start_id;
12. Finding Employees Who Have Worked on All Projects
MySQL:
SELECT e.employee_id, e.name
FROM employees e
JOIN project_assignments pa ON e.employee_id = pa.employee_id
GROUP BY e.employee_id, e.name
HAVING COUNT(DISTINCT pa.project_id) = (SELECT COUNT(DISTINCT project_id) FROM projects);
Spark SQL:
SELECT e.employee_id, e.name
FROM employees e
JOIN project_assignments pa ON e.employee_id = pa.employee_id
GROUP BY e.employee_id, e.name
HAVING COUNT(DISTINCT pa.project_id) = (SELECT COUNT(DISTINCT project_id) FROM projects);
13. Finding the Average Salary by Job Title and Department
MySQL:
SELECT job_title, department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY job_title, department_id;
Spark SQL:
SELECT job_title, department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY job_title, department_id;
These queries should provide a comprehensive challenge and help deepen your understanding of both MySQL and Spark SQL.
14.You have two tables: ‘response_times’ with columns (request_id, response_time_ms, device_type_id) and ‘device_types’ with columns (device_type_id, device_name, manufacturer). Write a query to calculate the 95th percentile of response times for each device manufacturer.
MySQL
MySQL does not have a built-in function to directly calculate percentiles, but you can use the PERCENT_RANK()
window function to achieve this. The PERCENT_RANK()
function assigns a rank to each row within the partition of the result set and calculates the relative rank of that row. Here’s how you can do it:
WITH ranked_responses AS (
SELECT
rt.device_type_id,
dt.manufacturer,
rt.response_time_ms,
PERCENT_RANK() OVER (PARTITION BY dt.manufacturer ORDER BY rt.response_time_ms) AS percentile_rank
FROM
response_times rt
JOIN
device_types dt ON rt.device_type_id = dt.device_type_id
)
SELECT
manufacturer,
MAX(response_time_ms) AS response_time_95th_percentile
FROM
ranked_responses
WHERE
percentile_rank <= 0.95
GROUP BY
manufacturer;
Explanation:
- Common Table Expression (CTE):
ranked_responses
CTE calculates the percentile rank for each response time within each manufacturer.
- PERCENT_RANK():
- Calculates the rank of each response time within each manufacturer, resulting in values between 0 and 1.
- Filtering and Grouping:
- The main query filters for rows with a percentile rank of 0.95 or less and then groups by manufacturer to get the 95th percentile.
Spark SQL
Spark SQL has a PERCENTILE
function which simplifies calculating percentiles directly. Here’s how you can achieve this in Spark SQL:
SELECT
dt.manufacturer,
PERCENTILE_APPROX(rt.response_time_ms, 0.95) AS response_time_95th_percentile
FROM
response_times rt
JOIN
device_types dt ON rt.device_type_id = dt.device_type_id
GROUP BY
dt.manufacturer;
Explanation:
- Join:
- Joins the
response_times
anddevice_types
tables ondevice_type_id
.
- Joins the
- PERCENTILE_APPROX():
- Uses the
PERCENTILE_APPROX()
function to approximate the 95th percentile of response times for each manufacturer.
- Uses the
- Group By:
- Groups the results by
manufacturer
to get the 95th percentile for each manufacturer.
- Groups the results by
These queries will give you the 95th percentile of response times for each device manufacturer in both MySQL and Spark SQL.
15.Given a table ‘daily_visits’ with columns (visit_date, visit_count), write a query to calculate the 7-day moving average of daily visits for each date.
To calculate the 7-day moving average of daily visits for each date, you can use window functions in both MySQL and Spark SQL. Here are the queries for each:
MySQL
In MySQL, you can use the AVG()
window function with a frame specification to calculate the 7-day moving average:
SELECT
visit_date,
visit_count,
AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
daily_visits
ORDER BY
visit_date;
Explanation:
- SELECT Clause:
visit_date
: Selects the visit date.visit_count
: Selects the visit count.AVG(visit_count) OVER (...)
: Calculates the average visit count over the specified window frame.
- Window Frame:
ORDER BY visit_date
: Orders the rows byvisit_date
.ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
: Defines the window frame to include the current row and the previous 6 rows, making a total of 7 days.
- ORDER BY Clause:
- Orders the result set by
visit_date
.
- Orders the result set by
Spark SQL
In Spark SQL, the approach is similar, using the AVG()
window function with a frame specification:
SELECT
visit_date,
visit_count,
AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
daily_visits
ORDER BY
visit_date;
Explanation:
The Spark SQL query is almost identical to the MySQL query. It calculates the 7-day moving average using the same window function and frame specification.
16.Given a table ‘stock_prices’ with columns (date, stock_symbol, closing_price). What’s the cumulative change in stock price compared to the starting price of the year?
To calculate the cumulative change in stock price compared to the starting price of the year for each stock symbol, you can use window functions in both MySQL and Spark SQL. Below are the queries for each database system:
MySQL
In MySQL, you can use the FIRST_VALUE()
window function to get the starting price of the year and then calculate the cumulative change:
SELECT
date,
stock_symbol,
closing_price,
closing_price - FIRST_VALUE(closing_price) OVER (PARTITION BY stock_symbol, YEAR(date) ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS cumulative_change
FROM
stock_prices
ORDER BY
stock_symbol, date;
Explanation:
- SELECT Clause:
date
: Selects the date.stock_symbol
: Selects the stock symbol.closing_price
: Selects the closing price.closing_price - FIRST_VALUE(closing_price) OVER (...)
: Calculates the cumulative change by subtracting the starting price of the year from the current closing price.
- Window Function:
FIRST_VALUE(closing_price) OVER (PARTITION BY stock_symbol, YEAR(date) ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
: Gets the first closing price of the year for each stock symbol.
- ORDER BY Clause:
- Orders the result set by
stock_symbol
anddate
.
- Orders the result set by
Spark SQL
In Spark SQL, the approach is similar, using the FIRST_VALUE()
window function to get the starting price of the year and then calculate the cumulative change:
SELECT
date,
stock_symbol,
closing_price,
closing_price - FIRST_VALUE(closing_price) OVER (PARTITION BY stock_symbol, YEAR(date) ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS cumulative_change
FROM
stock_prices
ORDER BY
stock_symbol, date;
Explanation:
The Spark SQL query is almost identical to the MySQL query. It calculates the cumulative change in stock price compared to the starting price of the year using the same window function and frame specification.
17.You have two tables: ‘products’ with columns (product_id, product_name, category_id, price) and ‘categories’ with columns (category_id, category_name). What is the price difference between each product and the next most expensive product in that category?
To find the price difference between each product and the next most expensive product within the same category, you can use window functions in both MySQL and Spark SQL. Here are the queries for each database system:
MySQL
In MySQL, you can use the LEAD()
window function to get the price of the next most expensive product within each category:
SELECT
p.product_id,
p.product_name,
p.category_id,
p.price,
c.category_name,
LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS next_higher_price,
p.price - LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS price_difference
FROM
products p
JOIN
categories c ON p.category_id = c.category_id
ORDER BY
p.category_id, p.price DESC;
Explanation:
- SELECT Clause:
- Selects the necessary columns from the
products
andcategories
tables. LEAD(p.price) OVER (...)
retrieves the price of the next most expensive product within the same category.- Calculates the price difference between the current product and the next most expensive product.
- Selects the necessary columns from the
- JOIN Clause:
- Joins the
products
andcategories
tables oncategory_id
.
- Joins the
- Window Function:
PARTITION BY p.category_id ORDER BY p.price DESC
partitions the data bycategory_id
and orders it byprice
in descending order.
- ORDER BY Clause:
- Orders the result set by
category_id
andprice
in descending order.
- Orders the result set by
Spark SQL
In Spark SQL, the approach is similar, using the LEAD()
window function to get the price of the next most expensive product within each category:
SELECT
p.product_id,
p.product_name,
p.category_id,
p.price,
c.category_name,
LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS next_higher_price,
p.price - LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS price_difference
FROM
products p
JOIN
categories c ON p.category_id = c.category_id
ORDER BY
p.category_id, p.price DESC;
Explanation:
The Spark SQL query is almost identical to the MySQL query. It calculates the price difference between each product and the next most expensive product within the same category using the same window function and frame specification.
Good Post but diagrams or chart wise explaining would have been better.