You have two tables: ‘response_times’ with columns (request_id, response_time_ms, device_type_id) and ‘device_types’ with columns (device_type_id, device_name, manufacturer). Write a query to calculate the 95th percentile of response times for each device manufacturer.
MySQL
MySQL does not have a built-in function to directly calculate percentiles, but you can use the PERCENT_RANK()
window function to achieve this. The PERCENT_RANK()
function assigns a rank to each row within the partition of the result set and calculates the relative rank of that row. Here’s how you can do it:
WITH ranked_responses AS (
SELECT
rt.device_type_id,
dt.manufacturer,
rt.response_time_ms,
PERCENT_RANK() OVER (PARTITION BY dt.manufacturer ORDER BY rt.response_time_ms) AS percentile_rank
FROM
response_times rt
JOIN
device_types dt ON rt.device_type_id = dt.device_type_id
)
SELECT
manufacturer,
MAX(response_time_ms) AS response_time_95th_percentile
FROM
ranked_responses
WHERE
percentile_rank <= 0.95
GROUP BY
manufacturer;
Explanation:
- Common Table Expression (CTE):
ranked_responses
CTE calculates the percentile rank for each response time within each manufacturer.
- PERCENT_RANK():
- Calculates the rank of each response time within each manufacturer, resulting in values between 0 and 1.
- Filtering and Grouping:
- The main query filters for rows with a percentile rank of 0.95 or less and then groups by manufacturer to get the 95th percentile.
Spark SQL
Spark SQL has a PERCENTILE
function which simplifies calculating percentiles directly. Here’s how you can achieve this in Spark SQL:
SELECT
dt.manufacturer,
PERCENTILE_APPROX(rt.response_time_ms, 0.95) AS response_time_95th_percentile
FROM
response_times rt
JOIN
device_types dt ON rt.device_type_id = dt.device_type_id
GROUP BY
dt.manufacturer;
Explanation:
- Join:
- Joins the
response_times
anddevice_types
tables ondevice_type_id
.
- Joins the
- PERCENTILE_APPROX():
- Uses the
PERCENTILE_APPROX()
function to approximate the 95th percentile of response times for each manufacturer.
- Uses the
- Group By:
- Groups the results by
manufacturer
to get the 95th percentile for each manufacturer.
- Groups the results by
These queries will give you the 95th percentile of response times for each device manufacturer in both MySQL and Spark SQL.
Given a table ‘daily_visits’ with columns (visit_date, visit_count), write a query to calculate the 7-day moving average of daily visits for each date.
To calculate the 7-day moving average of daily visits for each date, you can use window functions in both MySQL and Spark SQL. Here are the queries for each:
MySQL
In MySQL, you can use the AVG()
window function with a frame specification to calculate the 7-day moving average:
SELECT
visit_date,
visit_count,
AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
daily_visits
ORDER BY
visit_date;
Explanation:
- SELECT Clause:
visit_date
: Selects the visit date.visit_count
: Selects the visit count.AVG(visit_count) OVER (...)
: Calculates the average visit count over the specified window frame.
- Window Frame:
ORDER BY visit_date
: Orders the rows byvisit_date
.ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
: Defines the window frame to include the current row and the previous 6 rows, making a total of 7 days.
- ORDER BY Clause:
- Orders the result set by
visit_date
.
- Orders the result set by
Spark SQL
In Spark SQL, the approach is similar, using the AVG()
window function with a frame specification:
SELECT
visit_date,
visit_count,
AVG(visit_count) OVER (ORDER BY visit_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_7_days
FROM
daily_visits
ORDER BY
visit_date;
Explanation:
The Spark SQL query is almost identical to the MySQL query. It calculates the 7-day moving average using the same window function and frame specification.