1.To find the second-highest salary for each department in SQL?

Here are multiple methods to find the second-highest salary for each department in SQL, using different approaches.

1. Using Correlated Subquery

This approach involves using a subquery to find the highest salary for each department and then excluding it.

SELECT Department, MAX(Salary) AS SecondHighestSalary
FROM Employee e1
WHERE Salary < (SELECT MAX(Salary)
                FROM Employee e2
                WHERE e1.Department = e2.Department)
GROUP BY Department;

The subquery in SQL statement:

(SELECT MAX(Salary)
FROM Employee e2
WHERE e1.Department = e2.Department)

is calculating the maximum salary within the same department for each employee’s department in the outer query. This subquery is evaluated for each department that the employee e1 belongs to.

Explanation:

  • Outer Query (e1): The outer query is grouped by the Department and is responsible for finding the second-highest salary in each department.
  • Subquery (e2): The subquery is used to find the maximum salary in the department that the current row (from the outer query) belongs to. The WHERE e1.Department = e2.Department clause ensures that the subquery’s MAX(Salary) is only considering salaries within the same department as the current employee in e1.

To Clarify:

  • MAX(Salary) in Subquery: The MAX(Salary) is calculated for each department individually, not for the entire table. This happens because the subquery is correlated with the outer query by the department condition (e1.Department = e2.Department).
  • Execution Flow: For each row in the outer query, the subquery finds the maximum salary for the department of that row. The outer query then checks if the employee’s salary is less than this maximum salary and identifies the maximum salary among those that are less than the highest salary (which is the second-highest salary).

Example:

Consider the following data in the Employee table:

EmployeeIDDepartmentSalary
1HR5000
2HR7000
3HR6000
4IT8000
5IT12000
6IT9000

For the HR department:

  • The subquery for each e1 in HR will compute the maximum salary (7000) within HR.
  • The outer query will then find the highest salary less than 7000, which is 6000.

For the IT department:

  • The subquery for each e1 in IT will compute the maximum salary (12000) within IT.
  • The outer query will then find the highest salary less than 12000, which is 9000.

The final result will show the second-highest salary per department:

DepartmentSecondHighestSalary
HR6000
IT9000

2. Using Window Functions

Window functions provide a powerful way to solve this problem.

WITH RankedSalaries AS (
    SELECT 
        Department,
        Salary,
        DENSE_RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS Rank
    FROM Employee
)
SELECT 
    Department,
    Salary AS SecondHighestSalary
FROM RankedSalaries
WHERE Rank = 2;

3. Using Common Table Expressions (CTE) and ROW_NUMBER()

This approach uses a CTE to rank salaries and then selects the second-highest salary.

WITH RankedSalaries AS (
    SELECT 
        Department,
        Salary,
        ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) AS RowNum
    FROM Employee
)
SELECT 
    Department,
    Salary AS SecondHighestSalary
FROM RankedSalaries
WHERE RowNum = 2;

4. Using JOIN and Subquery

This method uses a join and subquery to find the second-highest salary for each department.

SELECT e1.Department, MAX(e1.Salary) AS SecondHighestSalary
FROM Employee e1
JOIN Employee e2 ON e1.Department = e2.Department AND e1.Salary < e2.Salary
GROUP BY e1.Department
HAVING COUNT(DISTINCT e2.Salary) = 1;

5. Using LIMIT and OFFSET (MySQL/MariaDB Specific)

If you’re using a database that supports LIMIT and OFFSET, such as MySQL or MariaDB, you can use them to directly fetch the second-highest salary.

SELECT Department, Salary AS SecondHighestSalary
FROM (
    SELECT Department, Salary
    FROM Employee
    ORDER BY Department, Salary DESC
    LIMIT 2 OFFSET 1
) AS SecondHighest
GROUP BY Department;

6. Using DISTINCT and TOP (SQL Server Specific)

In SQL Server, you can use DISTINCT with TOP and ORDER BY to achieve this.

SELECT Department, MAX(Salary) AS SecondHighestSalary
FROM (
    SELECT DISTINCT Salary, Department
    FROM Employee
) AS DistinctSalaries
WHERE Salary < (SELECT MAX(Salary)
                FROM Employee e2
                WHERE DistinctSalaries.Department = e2.Department)
GROUP BY Department;

7. Using CROSS APPLY (SQL Server Specific)

SQL Server’s CROSS APPLY can be useful for this kind of problem.

SELECT Department, MAX(Salary) AS SecondHighestSalary
FROM (
    SELECT e1.Department, e1.Salary
    FROM Employee e1
    CROSS APPLY (
        SELECT TOP 2 Salary
        FROM Employee e2
        WHERE e1.Department = e2.Department
        ORDER BY Salary DESC
    ) AS TopSalaries
    WHERE TopSalaries.Salary != e1.Salary
) AS SecondHighestSalaries
GROUP BY Department;

Example Table and Query Results

Assuming we have the following Employee table:

EmployeeIDNameDepartmentSalary
1AliceHR70000
2BobHR80000
3CharlieIT60000
4DavidIT75000
5EveIT90000
6FrankSales50000
7GraceSales60000
8HelenSales55000

Each method would yield:

DepartmentSecondHighestSalary
HR70000
IT75000
Sales55000

These methods provide various ways to find the second-highest salary for each department using different SQL techniques and functions. Choose the one that best fits your SQL dialect and requirements.

Q2. Difference of scores for each subjects from Predecessor year for each subject ,difference of scores for each subject from successor year for each subject , difference of total scores for each student from predecessor and successor from previous year for a student in SQL, Spark sql?

let’s break this down into different parts and provide solutions in both standard SQL and Spark SQL. We’ll assume we have a table Scores with columns StudentID, Year, Subject, and Score.

1. Difference of Scores for Each Subject from Predecessor

For each student, calculate the difference in scores for each subject from the previous year (predecessor).

WITH RankedScores AS (
    SELECT 
        StudentID,
        Subject,
        Year,
        Score,
        LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS PrevScore
    FROM Scores
)
SELECT
    StudentID,
    Subject,
    Year,
    Score,
    Score - PrevScore AS ScoreDiffFromPredecessor
FROM RankedScores
WHERE PrevScore IS NOT NULL;

Spark SQL

SELECT 
    StudentID,
    Subject,
    Year,
    Score,
    Score - LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS ScoreDiffFromPredecessor
FROM Scores
WHERE LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) IS NOT NULL;

2. Difference of Scores for Each Subject from Successor

For each student, calculate the difference in scores for each subject from the next year (successor).

SQL

sqlCopy codeWITH RankedScores AS (
    SELECT 
        StudentID,
        Subject,
        Year,
        Score,
        LEAD(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS NextScore
    FROM Scores
)
SELECT
    StudentID,
    Subject,
    Year,
    Score,
    Score - NextScore AS ScoreDiffFromSuccessor
FROM RankedScores
WHERE NextScore IS NOT NULL;

Spark SQL

SELECT 
    StudentID,
    Subject,
    Year,
    Score,
    Score - LEAD(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS ScoreDiffFromSuccessor
FROM Scores
WHERE LEAD(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) IS NOT NULL;

3. Difference of Total Scores for Each Student from Predecessor and Successor

For each student, calculate the difference in total scores from the previous and next year.

sqlCopy codeWITH TotalScores AS (
    SELECT
        StudentID,
        Year,
        SUM(Score) AS TotalScore
    FROM Scores
    GROUP BY StudentID, Year
),
RankedTotalScores AS (
    SELECT
        StudentID,
        Year,
        TotalScore,
        LAG(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) AS PrevTotalScore,
        LEAD(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) AS NextTotalScore
    FROM TotalScores
)
SELECT
    StudentID,
    Year,
    TotalScore,
    TotalScore - PrevTotalScore AS TotalScoreDiffFromPredecessor,
    TotalScore - NextTotalScore AS TotalScoreDiffFromSuccessor
FROM RankedTotalScores
WHERE PrevTotalScore IS NOT NULL OR NextTotalScore IS NOT NULL;

Spark SQL

WITH TotalScores AS (
    SELECT
        StudentID,
        Year,
        SUM(Score) AS TotalScore
    FROM Scores
    GROUP BY StudentID, Year
)
SELECT
    StudentID,
    Year,
    TotalScore,
    TotalScore - LAG(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) AS TotalScoreDiffFromPredecessor,
    TotalScore - LEAD(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) AS TotalScoreDiffFromSuccessor
FROM TotalScores
WHERE LAG(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) IS NOT NULL
   OR LEAD(TotalScore) OVER (PARTITION BY StudentID ORDER BY Year) IS NOT NULL;

4. Difference of Scores from Previous Year for a Student

For each student, calculate the difference in scores from the previous year.

SQL

WITH RankedScores AS (
    SELECT 
        StudentID,
        Year,
        SUM(Score) AS TotalScore,
        LAG(SUM(Score)) OVER (PARTITION BY StudentID ORDER BY Year) AS PrevTotalScore
    FROM Scores
    GROUP BY StudentID, Year
)
SELECT
    StudentID,
    Year,
    TotalScore,
    TotalScore - PrevTotalScore AS ScoreDiffFromPreviousYear
FROM RankedScores
WHERE PrevTotalScore IS NOT NULL;

Spark SQL

SELECT 
    StudentID,
    Year,
    SUM(Score) AS TotalScore,
    SUM(Score) - LAG(SUM(Score)) OVER (PARTITION BY StudentID ORDER BY Year) AS ScoreDiffFromPreviousYear
FROM Scores
GROUP BY StudentID, Year
HAVING LAG(SUM(Score)) OVER (PARTITION BY StudentID ORDER BY Year) IS NOT NULL;

Explanation

  • LAG(): Retrieves the value of a column from a preceding row within the same partition.
  • LEAD(): Retrieves the value of a column from a following row within the same partition.
  • PARTITION BY: Divides the result set into partitions to which the window function is applied.
  • ORDER BY: Defines the logical order of the rows within each partition.

Q3.Difference of scores for each subjects from Predecessor only for each subject in same class, difference of scores for each subject from successor only for each subject in same class , difference of total scores for a student from his predecessor and successor in a same class, also difference of scores from previous year for a student in SQL, Spark sql?

Let’s break down the requirements and then implement them in SQL and Spark SQL. We’ll assume we have a table with the following structure:

CREATE TABLE StudentScores (
    StudentID INT,
    Year INT,
    Class INT,
    Subject VARCHAR(50),
    Score INT
);

Requirements

  1. Difference of scores for each subject from predecessor for each subject.
  2. Difference of scores for each subject from successor for each subject.
  3. Difference of total scores for a student from their predecessor and successor in the same class.
  4. Difference of scores from the previous year for a student.

SQL Implementation

Let’s assume we are working with a relational database supporting standard SQL. We’ll use window functions for some of these requirements.

1. Difference of scores for each subject from predecessor for each subject

SELECT 
    StudentID,
    Year,
    Class,
    Subject,
    Score,
    Score - LAG(Score) OVER (PARTITION BY Class, Subject ORDER BY Year) AS DiffFromPredecessor
FROM 
    StudentScores;

2. Difference of scores for each subject from successor for each subject

SELECT 
    StudentID,
    Year,
    Class,
    Subject,
    Score,
    LEAD(Score) OVER (PARTITION BY Class, Subject ORDER BY Year) - Score AS DiffFromSuccessor
FROM 
    StudentScores;

3. Difference of total scores for a student from their predecessor and successor in the same class

First, let’s compute the total scores per student per year:

WITH TotalScores AS (
    SELECT 
        StudentID,
        Year,
        Class,
        SUM(Score) AS TotalScore
    FROM 
        StudentScores
    GROUP BY 
        StudentID, Year, Class
)
SELECT 
    StudentID,
    Year,
    Class,
    TotalScore,
    TotalScore - LAG(TotalScore) OVER (PARTITION BY Class ORDER BY Year) AS DiffFromPredecessor,
    LEAD(TotalScore) OVER (PARTITION BY Class ORDER BY Year) - TotalScore AS DiffFromSuccessor
FROM 
    TotalScores;

4. Difference of scores from the previous year for a student

SELECT 
    StudentID,
    Year,
    Class,
    Subject,
    Score,
    Score - LAG(Score) OVER (PARTITION BY StudentID, Subject ORDER BY Year) AS DiffFromPreviousYear
FROM 
    StudentScores;

Spark SQL Implementation

The implementation in Spark SQL would be similar. Here’s how you can do it:

Setup

First, let’s create a DataFrame df with the appropriate schema.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F

spark = SparkSession.builder.appName("StudentScores").getOrCreate()

data = [
    (1, 2023, 10, 'Math', 85),
    (1, 2022, 10, 'Math', 80),
    (2, 2023, 10, 'Math', 75),
    (2, 2022, 10, 'Math', 70),
    (1, 2023, 10, 'Science', 90),
    (1, 2022, 10, 'Science', 85),
    (2, 2023, 10, 'Science', 88),
    (2, 2022, 10, 'Science', 84)
]

columns = ['StudentID', 'Year', 'Class', 'Subject', 'Score']

df = spark.createDataFrame(data, columns)
df.createOrReplaceTempView("StudentScores")

1. Difference of scores for each subject from predecessor for each subject

window_spec = Window.partitionBy('Class', 'Subject').orderBy('Year')

df.withColumn('DiffFromPredecessor', F.col('Score') - F.lag('Score').over(window_spec)).show()

2. Difference of scores for each subject from successor for each subject

df.withColumn('DiffFromSuccessor', F.lead('Score').over(window_spec) - F.col('Score')).show()

3. Difference of total scores for a student from their predecessor and successor in the same class

First, compute the total scores per student per year:

total_scores = df.groupBy('StudentID', 'Year', 'Class').agg(F.sum('Score').alias('TotalScore'))
total_scores.createOrReplaceTempView("TotalScores")

window_spec_total = Window.partitionBy('Class').orderBy('Year')

total_scores.withColumn('DiffFromPredecessor', F.col('TotalScore') - F.lag('TotalScore').over(window_spec_total)) 
            .withColumn('DiffFromSuccessor', F.lead('TotalScore').over(window_spec_total) - F.col('TotalScore')) 
            .show()

4. Difference of scores from the previous year for a student

window_spec_student = Window.partitionBy('StudentID', 'Subject').orderBy('Year')

df.withColumn('DiffFromPreviousYear', F.col('Score') - F.lag('Score').over(window_spec_student)).show()

These implementations use window functions to calculate the required differences in both standard SQL and Spark SQL. Adjust column names and table names as per your actual schema.

4.Rolling Sum and Nth salary based questions in MySQL or Spark SQL

Here are examples of how to perform rolling sums and find the nth salary using both MySQL and Spark SQL.

Rolling Sum

MySQL

To calculate a rolling sum in MySQL, you can use the SUM() function with a window frame specification.

-- Create a sample table
CREATE TABLE salaries (
    id INT PRIMARY KEY,
    employee_name VARCHAR(50),
    salary INT,
    transaction_date DATE
);

-- Insert sample data
INSERT INTO salaries (id, employee_name, salary, transaction_date) VALUES
(1, 'Alice', 5000, '2023-01-01'),
(2, 'Bob', 6000, '2023-01-02'),
(3, 'Charlie', 7000, '2023-01-03'),
(4, 'Alice', 8000, '2023-01-04'),
(5, 'Bob', 9000, '2023-01-05');

-- Rolling sum query
SELECT
    id,
    employee_name,
    salary,
    transaction_date,
    SUM(salary) OVER (ORDER BY transaction_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_sum
FROM
    salaries
ORDER BY
    transaction_date;

Spark SQL

To calculate a rolling sum in Spark SQL, use the SUM() function with a window specification.

-- Create a sample table
CREATE TABLE salaries (
    id INT,
    employee_name STRING,
    salary INT,
    transaction_date DATE
);

-- Insert sample data
INSERT INTO salaries VALUES
(1, 'Alice', 5000, '2023-01-01'),
(2, 'Bob', 6000, '2023-01-02'),
(3, 'Charlie', 7000, '2023-01-03'),
(4, 'Alice', 8000, '2023-01-04'),
(5, 'Bob', 9000, '2023-01-05');

-- Rolling sum query
SELECT
    id,
    employee_name,
    salary,
    transaction_date,
    SUM(salary) OVER (ORDER BY transaction_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_sum
FROM
    salaries
ORDER BY
    transaction_date;

Nth Salary

MySQL

To find the nth highest salary in MySQL, you can use the DENSE_RANK() function.

-- Nth highest salary query (e.g., 3rd highest salary)
SET @N = 3;

SELECT DISTINCT
    salary
FROM
    salaries
ORDER BY
    salary DESC
LIMIT 1 OFFSET @N - 1;

Alternatively, using DENSE_RANK():

WITH RankedSalaries AS (
SELECT
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rank
FROM
salaries
)
SELECT
salary
FROM
RankedSalaries
WHERE
rank = @N;

Spark SQL

To find the nth highest salary in Spark SQL, you can use the DENSE_RANK() function.

-- Nth highest salary query (e.g., 3rd highest salary)
SET @N = 3;

SELECT DISTINCT
    salary
FROM
    salaries
ORDER BY
    salary DESC
LIMIT @N;

Alternatively, using DENSE_RANK():

WITH RankedSalaries AS (
    SELECT
        salary,
        DENSE_RANK() OVER (ORDER BY salary DESC) AS rank
    FROM
        salaries
)
SELECT
    salary
FROM
    RankedSalaries
WHERE
    rank = @N;
  • Rolling Sum: Both MySQL and Spark SQL use the SUM() function with a window specification to calculate a rolling sum.
  • Nth Salary: Both MySQL and Spark SQL can use the DENSE_RANK() function to find the nth highest salary. The LIMIT and OFFSET combination can also be used in MySQL for simpler cases.

These queries illustrate the power of SQL window functions in both MySQL and Spark SQL for advanced data processing tasks.

5.Mysql Query to find employees who earn more than their managers

To find employees who earn more than their managers, we need to assume that we have a hierarchical table structure where each employee has a reference to their manager’s ID. Below is an example schema and the corresponding MySQL query:

Schema Example

Assume we have the following table employees:

CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50),
salary DECIMAL(10, 2),
manager_id INT,
FOREIGN KEY (manager_id) REFERENCES employees(id)
);

Sample Data

INSERT INTO employees (id, name, salary, manager_id) VALUES
(1, 'Alice', 90000, NULL),      -- Alice is the CEO, no manager
(2, 'Bob', 70000, 1),           -- Bob reports to Alice
(3, 'Charlie', 80000, 1),       -- Charlie reports to Alice
(4, 'David', 60000, 2),         -- David reports to Bob
(5, 'Eva', 75000, 2),           -- Eva reports to Bob
(6, 'Frank', 85000, 3),         -- Frank reports to Charlie
(7, 'Grace', 50000, 3);         -- Grace reports to Charlie

Query to Find Employees Who Earn More Than Their Managers

SELECT e.id, e.name, e.salary, e.manager_id, m.name AS manager_name, m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;

Explanation

  • FROM employees e: This selects all employees as e.
  • JOIN employees m ON e.manager_id = m.id: This joins the employees table with itself to get each employee’s manager. m is an alias for the manager.
  • WHERE e.salary > m.salary: This filters the results to include only those employees whose salary is greater than their manager’s salary.

Result

Running the query on the sample data will produce the employees who earn more than their managers. The expected result for the provided sample data is:

idnamesalarymanager_idmanager_namemanager_salary
3Charlie800001Alice90000
5Eva750002Bob70000
6Frank850003Charlie80000

6.Finding Employees with the Highest Salary in Each Department

MySQL:

SELECT department_id, employee_id, salary
FROM (
SELECT department_id, employee_id, salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees
) ranked_employees
WHERE rank = 1;

Spark SQL:

SELECT department_id, employee_id, salary
FROM (
SELECT department_id, employee_id, salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees
) ranked_employees
WHERE rank = 1;

7. Finding the Second Highest Salary in the Company

MySQL:

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Spark SQL:

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

8. Finding Employees with More Than Average Salary in Their Department

MySQL:

SELECT e.employee_id, e.name, e.salary, e.department_id
FROM employees e
JOIN (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
) d ON e.department_id = d.department_id
WHERE e.salary > d.avg_salary;

Spark SQL:

SELECT e.employee_id, e.name, e.salary, e.department_id
FROM employees e
JOIN (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
) d ON e.department_id = d.department_id
WHERE e.salary > d.avg_salary;

9. Finding the Cumulative Salary by Department Ordered by Employee ID

MySQL:

SELECT employee_id, department_id, salary,
SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id) AS cumulative_salary
FROM employees;

Spark SQL:

SELECT employee_id, department_id, salary,
SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id) AS cumulative_salary
FROM employees;

10. Finding the Manager with the Maximum Number of Direct Reports

MySQL:

SELECT manager_id, COUNT(*) AS direct_reports
FROM employees
WHERE manager_id IS NOT NULL
GROUP BY manager_id
ORDER BY direct_reports DESC
LIMIT 1;

Spark SQL:

SELECT manager_id, COUNT(*) AS direct_reports
FROM employees
WHERE manager_id IS NOT NULL
GROUP BY manager_id
ORDER BY direct_reports DESC
LIMIT 1;

11. Identifying Gaps in Employee IDs

MySQL:

SELECT id + 1 AS start_id, MIN(id) - 1 AS end_id
FROM employees
WHERE (id + 1) NOT IN (SELECT id FROM employees)
GROUP BY id
ORDER BY start_id;

Spark SQL:

SELECT id + 1 AS start_id, MIN(id) - 1 AS end_id
FROM employees
WHERE (id + 1) NOT IN (SELECT id FROM employees)
GROUP BY id
ORDER BY start_id;

12. Finding Employees Who Have Worked on All Projects

MySQL:

SELECT e.employee_id, e.name
FROM employees e
JOIN project_assignments pa ON e.employee_id = pa.employee_id
GROUP BY e.employee_id, e.name
HAVING COUNT(DISTINCT pa.project_id) = (SELECT COUNT(DISTINCT project_id) FROM projects);

Spark SQL:

SELECT e.employee_id, e.name
FROM employees e
JOIN project_assignments pa ON e.employee_id = pa.employee_id
GROUP BY e.employee_id, e.name
HAVING COUNT(DISTINCT pa.project_id) = (SELECT COUNT(DISTINCT project_id) FROM projects);

13. Finding the Average Salary by Job Title and Department

MySQL:

SELECT job_title, department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY job_title, department_id;

Spark SQL:

SELECT job_title, department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY job_title, department_id;

These queries should provide a comprehensive challenge and help deepen your understanding of both MySQL and Spark SQL.

Source for Below Questions

14.You have two tables: ‘response_times’ with columns (request_id, response_time_ms, device_type_id) and ‘device_types’ with columns (device_type_id, device_name, manufacturer). Write a query to calculate the 95th percentile of response times for each device manufacturer.

MySQL

MySQL does not have a built-in function to directly calculate percentiles, but you can use the PERCENT_RANK() window function to achieve this. The PERCENT_RANK() function assigns a rank to each row within the partition of the result set and calculates the relative rank of that row. Here’s how you can do it:

WITH ranked_responses AS (
    SELECT
        rt.device_type_id,
        dt.manufacturer,
        rt.response_time_ms,
        PERCENT_RANK() OVER (PARTITION BY dt.manufacturer ORDER BY rt.response_time_ms) AS percentile_rank
    FROM
        response_times rt
    JOIN
        device_types dt ON rt.device_type_id = dt.device_type_id
)
SELECT
    manufacturer,
    MAX(response_time_ms) AS response_time_95th_percentile
FROM
    ranked_responses
WHERE
    percentile_rank <= 0.95
GROUP BY
    manufacturer;

Explanation:

  1. Common Table Expression (CTE):
    • ranked_responses CTE calculates the percentile rank for each response time within each manufacturer.
  2. PERCENT_RANK():
    • Calculates the rank of each response time within each manufacturer, resulting in values between 0 and 1.
  3. Filtering and Grouping:
    • The main query filters for rows with a percentile rank of 0.95 or less and then groups by manufacturer to get the 95th percentile.

Spark SQL

Spark SQL has a PERCENTILE function which simplifies calculating percentiles directly. Here’s how you can achieve this in Spark SQL:

SELECT
    dt.manufacturer,
    PERCENTILE_APPROX(rt.response_time_ms, 0.95) AS response_time_95th_percentile
FROM
    response_times rt
JOIN
    device_types dt ON rt.device_type_id = dt.device_type_id
GROUP BY
    dt.manufacturer;

Explanation:

  1. Join:
    • Joins the response_times and device_types tables on device_type_id.
  2. PERCENTILE_APPROX():
    • Uses the PERCENTILE_APPROX() function to approximate the 95th percentile of response times for each manufacturer.
  3. Group By:
    • Groups the results by manufacturer to get the 95th percentile for each manufacturer.

These queries will give you the 95th percentile of response times for each device manufacturer in both MySQL and Spark SQL.

15.Given a table ‘daily_visits’ with columns (visit_date, visit_count), write a query to calculate the 7-day moving average of daily visits for each date.

To calculate the 7-day moving average of daily visits for each date, you can use window functions in both MySQL and Spark SQL. Here are the queries for each:

MySQL

In MySQL, you can use the AVG() window function with a frame specification to calculate the 7-day moving average:

SELECT
    visit_date,
    visit_count,
    AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
    daily_visits
ORDER BY
    visit_date;

Explanation:

  1. SELECT Clause:
    • visit_date: Selects the visit date.
    • visit_count: Selects the visit count.
    • AVG(visit_count) OVER (...): Calculates the average visit count over the specified window frame.
  2. Window Frame:
    • ORDER BY visit_date: Orders the rows by visit_date.
    • ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: Defines the window frame to include the current row and the previous 6 rows, making a total of 7 days.
  3. ORDER BY Clause:
    • Orders the result set by visit_date.

Spark SQL

In Spark SQL, the approach is similar, using the AVG() window function with a frame specification:

SELECT
    visit_date,
    visit_count,
    AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
    daily_visits
ORDER BY
    visit_date;

Explanation:

The Spark SQL query is almost identical to the MySQL query. It calculates the 7-day moving average using the same window function and frame specification.

16.Given a table ‘stock_prices’ with columns (date, stock_symbol, closing_price). What’s the cumulative change in stock price compared to the starting price of the year?

To calculate the cumulative change in stock price compared to the starting price of the year for each stock symbol, you can use window functions in both MySQL and Spark SQL. Below are the queries for each database system:

MySQL

In MySQL, you can use the FIRST_VALUE() window function to get the starting price of the year and then calculate the cumulative change:

SELECT
    date,
    stock_symbol,
    closing_price,
    closing_price - FIRST_VALUE(closing_price) OVER (PARTITION BY stock_symbol, YEAR(date) ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS cumulative_change
FROM
    stock_prices
ORDER BY
    stock_symbol, date;

Explanation:

  1. SELECT Clause:
    • date: Selects the date.
    • stock_symbol: Selects the stock symbol.
    • closing_price: Selects the closing price.
    • closing_price - FIRST_VALUE(closing_price) OVER (...): Calculates the cumulative change by subtracting the starting price of the year from the current closing price.
  2. Window Function:
    • FIRST_VALUE(closing_price) OVER (PARTITION BY stock_symbol, YEAR(date) ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING): Gets the first closing price of the year for each stock symbol.
  3. ORDER BY Clause:
    • Orders the result set by stock_symbol and date.

Spark SQL

In Spark SQL, the approach is similar, using the FIRST_VALUE() window function to get the starting price of the year and then calculate the cumulative change:

SELECT
    date,
    stock_symbol,
    closing_price,
    closing_price - FIRST_VALUE(closing_price) OVER (PARTITION BY stock_symbol, YEAR(date) ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS cumulative_change
FROM
    stock_prices
ORDER BY
    stock_symbol, date;

Explanation:

The Spark SQL query is almost identical to the MySQL query. It calculates the cumulative change in stock price compared to the starting price of the year using the same window function and frame specification.

17.You have two tables: ‘products’ with columns (product_id, product_name, category_id, price) and ‘categories’ with columns (category_id, category_name). What is the price difference between each product and the next most expensive product in that category?

To find the price difference between each product and the next most expensive product within the same category, you can use window functions in both MySQL and Spark SQL. Here are the queries for each database system:

MySQL

In MySQL, you can use the LEAD() window function to get the price of the next most expensive product within each category:

SELECT
    p.product_id,
    p.product_name,
    p.category_id,
    p.price,
    c.category_name,
    LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS next_higher_price,
    p.price - LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS price_difference
FROM
    products p
JOIN
    categories c ON p.category_id = c.category_id
ORDER BY
    p.category_id, p.price DESC;

Explanation:

  1. SELECT Clause:
    • Selects the necessary columns from the products and categories tables.
    • LEAD(p.price) OVER (...) retrieves the price of the next most expensive product within the same category.
    • Calculates the price difference between the current product and the next most expensive product.
  2. JOIN Clause:
    • Joins the products and categories tables on category_id.
  3. Window Function:
    • PARTITION BY p.category_id ORDER BY p.price DESC partitions the data by category_id and orders it by price in descending order.
  4. ORDER BY Clause:
    • Orders the result set by category_id and price in descending order.

Spark SQL

In Spark SQL, the approach is similar, using the LEAD() window function to get the price of the next most expensive product within each category:

SELECT
    p.product_id,
    p.product_name,
    p.category_id,
    p.price,
    c.category_name,
    LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS next_higher_price,
    p.price - LEAD(p.price) OVER (PARTITION BY p.category_id ORDER BY p.price DESC) AS price_difference
FROM
    products p
JOIN
    categories c ON p.category_id = c.category_id
ORDER BY
    p.category_id, p.price DESC;

Explanation:

The Spark SQL query is almost identical to the MySQL query. It calculates the price difference between each product and the next most expensive product within the same category using the same window function and frame specification.

Trending