Using an Uncorrelated Subquery
The Problem
You want to find the largest row in each group of rows.
An example is looking for the largest city in each state.
While it is easy to find the MAX(population) ... GROUP BY state,
it is hard to find the name of the city associated with that population.
Alas, MySQL does not have any syntax to provide the solution directly.
This blog presents multiple "good" solutions.
They differ in ways that make none of them 'perfect'; you should try each
and weigh the pros and cons.
Also, a few "bad" solutions will be presented, together with why they were rejected.
MySQL manual
gives 3 solutions; only the "Uncorrelated" one is "good", the other two are "bad".
The tip off of having terrible performance is s1.price < s2.price.
Sample Data
To show how the various coding attempts work, I have devised this simple task:
Find the largest city in each Canadian province. Here's a sample of the source data (5493 rows):
+------------------+----------------+------------+
| province | city | population |
+------------------+----------------+------------+
| Saskatchewan | Rosetown | 2309 |
| British Columbia | Chilliwack | 51942 |
| Nova Scotia | Yarmouth | 7500 |
| Alberta | Grande Prairie | 41463 |
| Quebec | Sorel | 33591 |
| Ontario | Moose Factory | 2060 |
| Ontario | Bracebridge | 8238 |
| British Columbia | Nanaimo | 84906 |
| Manitoba | Neepawa | 3151 |
| Alberta | Grimshaw | 2560 |
| Saskatchewan | Carnduff | 950 |
...
Here's the desired output (13 rows):
+---------------------------+---------------+------------+
| province | city | population |
+---------------------------+---------------+------------+
| Alberta | Calgary | 968475 |
| British Columbia | Vancouver | 1837970 |
| Manitoba | Winnipeg | 632069 |
| New Brunswick | Saint John | 87857 |
| Newfoundland and Labrador | Corner Brook | 18693 |
| Northwest Territories | Yellowknife | 15866 |
| Nova Scotia | Halifax | 266012 |
| Nunavut | Iqaluit | 6124 |
| Ontario | Toronto | 4612187 |
| Prince Edward Island | Charlottetown | 42403 |
| Quebec | Montreal | 3268513 |
| Saskatchewan | Saskatoon | 198957 |
| Yukon | Whitehorse | 19616 |
+---------------------------+---------------+------------+
Duplicate Max
One thing to consider is whether you want -- or do not want -- to see multiple
rows for tied winners. For the dataset being used here, that would
imply that the two largest cities in a province had identical populations.
For this case, a dup would be unlikely. But there are many groupwise-max
use cases where dups are likely.
The two best algorithms differ in whether they show dups.
Using an Uncorrelated Subquery
Characteristics:
⚈ Superior performance or medium performance
⚈ It will show duplicates
⚈ Needs an extra index
⚈ Probably requires 5.6
⚈ If all goes well, it will run in O(M) where M is the number of output rows.
An 'uncorrelated subquery':
SELECT c1.province, c1.city, c1.population
FROM Canada AS c1
JOIN
( SELECT province, MAX(population) AS population
FROM Canada
GROUP BY province
) AS c2 USING (province, population)
ORDER BY c1.province;
But this also 'requires' an extra index: INDEX(province, population).
In addition, MySQL has not always been able to use that index effectively,
hence the "requires 5.6". (I am not sure of the actual version.)
Without that extra index, you would need 5.6, which has the
ability to create indexes for subqueries.
This is indicated by <auto_key0> in the EXPLAIN.
Even so, the performance is worse with the auto-generated index
than with the manually generated one.
With neither the extra index, nor 5.6, this 'solution' would belong in 'The Duds'
becuase it would run in O(N*N) time.
Caveat: Some of the solutions below have problems with versions MariaDB 5.5 and MySQL 5.7 and later.
Using "windowing functions"
These OVER feature is available only in MySQL 8.0 and MariaDB 10.2 (or later).
Characteristics:
⚈ Medium performance
⚈ It will not show duplicates
⚈ Needs an extra index(province, population)
⚈ Requires MariaDB 10.2+ or MySQL 8.0+
⚈ Somewhat slower than O(N) where N is the number of input rows.
WITH rank_added AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY province ORDER BY population DESC) AS rn
FROM canada
)
SELECT province, city, population FROM rank_added
WHERE rn = 1;
Using @variables
Characteristics:
⚈ Good performance
⚈ Does not show duplicates (picks one to show)
⚈ Consistent O(N) run time (N = number of input rows)
⚈ Only one scan of the data
⚈ Assigning @variables is being deprecated (8.0.13), so don't use this
SELECT
province, city, population -- The desired columns
FROM
( SELECT @prev := '' ) init
JOIN
( SELECT province != @prev AS first, -- `province` is the 'GROUP BY'
@prev := province, -- The 'GROUP BY'
province, city, population -- Also the desired columns
FROM Canada -- The table
ORDER BY
province, -- The 'GROUP BY'
population DESC -- ASC for MIN(population), DESC for MAX
LIMIT 999999 -- kludge to keep the ORDER BY from being ignored
) x
WHERE first
ORDER BY province; -- Whatever you like
For your application, change the lines with comments.
LATERAL
TBD (8.0.14)
The Duds
*** 'correlated subquery' (from MySQL doc):
SELECT province, city, population
FROM Canada AS c1
WHERE population =
( SELECT MAX(c2.population)
FROM Canada AS c2
WHERE c2.province= c1.province
)
ORDER BY province;
O(N*N) (that is, terrible) performance
*** LEFT JOIN (from MySQL doc):
SELECT c1.province, c1.city, c1.population
FROM Canada AS c1
LEFT JOIN Canada AS c2 ON c2.province = c1.province
AND c2.population > c1.population
WHERE c2.province IS NULL
ORDER BY province;
Medium performance (2N-3N, depending on join_buffer_size).
For O(N*N) time,...
It will take one second to do a groupwise-max
on a few thousand rows; a million rows could take hours.
Top-N in each Group
This is a variant on "groupwise-max" wherein you desire the largest (or smallest) N items in each group.
Do these substitutions for your use case:
⚈ province --> your 'GROUP BY'
⚈ Canada --> your table
⚈ 3 --> how many of each group to show
⚈ population --> your numeric field for determining "Top-N"
⚈ city --> more field(s) you want to show
⚈ Change the SELECT and ORDER BY if you desire
⚈ DESC to get the 'largest'; ASC for the 'smallest'
A different O(N*N)
SELECT province, n, city, population FROM ( SELECT @prev := '', @n := 0 ) init JOIN ( SELECT @n := if(province != @prev, 1, @n + 1) AS n, @prev := province, province, city, population FROM Canada ORDER BY province ASC, population DESC LIMIT 999999 -- kludge to keep the ORDER BY from being ignored ) x WHERE n <= 3 ORDER BY province, n;Output:
+---------------------------+------+------------------+------------+ | province | n | city | population | +---------------------------+------+------------------+------------+ | Alberta | 1 | Calgary | 968475 | | Alberta | 2 | Edmonton | 822319 | | Alberta | 3 | Red Deer | 73595 | | British Columbia | 1 | Vancouver | 1837970 | | British Columbia | 2 | Victoria | 289625 | | British Columbia | 3 | Abbotsford | 151685 | | Manitoba | 1 | ...
+----+-------------+------------+--------+---------------+------+---------+------+------+----------+----------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+--------+---------------+------+---------+------+------+----------+----------------+ | 1 | PRIMARY | <derived2> | system | NULL | NULL | NULL | NULL | 1 | 100.00 | Using filesort | | 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 5484 | 100.00 | Using where | | 3 | DERIVED | Canada | ALL | NULL | NULL | NULL | NULL | 5484 | 100.00 | Using filesort | | 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used | +----+-------------+------------+--------+---------------+------+---------+------+------+----------+----------------+Explanation, shown in the same order as the EXPLAIN, but numbered chronologically:
| Handler_read_rnd | 39 | | Handler_read_rnd_next | 10971 | | Handler_write | 5485 | -- #rows in Canada (+1)
SELECT province, city, population FROM Canada JOIN ( SELECT GROUP_CONCAT(top_in_province) AS top_cities FROM ( SELECT SUBSTRING_INDEX( GROUP_CONCAT(city ORDER BY population DESC), ',', 3) AS top_in_province FROM Canada GROUP BY province ) AS x ) AS y WHERE FIND_IN_SET(city, top_cities) ORDER BY province, population DESC;Output. Note how there can be more than 3 cities per province. This is because "city" is not unique across the data set.
| Alberta | Calgary | 968475 | | Alberta | Edmonton | 822319 | | Alberta | Red Deer | 73595 | | British Columbia | Vancouver | 1837970 | | British Columbia | Victoria | 289625 | | British Columbia | Abbotsford | 151685 | | British Columbia | Sydney | 0 | -- Nova Scotia's second largest is Sydney | Manitoba | Winnipeg | 632069 |
| Handler_read_next | 5484 | -- table size | Handler_read_rnd_next | 5500 | -- table size + number of provinces | Handler_write | 14 | -- number of provinces (+1)
SELECT * FROM ( SELECT province, ROW_NUMBER() OVER(PARTITION BY province ORDER BY population DESC) AS n, city, population FROM canada ) x WHERE n <= 3;
-- build tmp table to get numbering -- (Assumes auto_increment_increment = 1) CREATE TEMPORARY TABLE t ( nth MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT, PRIMARY KEY(province, nth) ) ENGINE=MyISAM SELECT province, NULL AS nth, city, population FROM Canada ORDER BY population DESC; -- Output the biggest 3 cities in each province: SELECT province, nth, city, population FROM t WHERE nth <= 3 ORDER BY province, nth; +---------------------------+-----+------------------+------------+ | province | nth | city | population | +---------------------------+-----+------------------+------------+ | Alberta | 1 | Calgary | 968475 | | Alberta | 2 | Edmonton | 822319 | | Alberta | 3 | Red Deer | 73595 | | British Columbia | 1 | Vancouver | 1837970 | | British Columbia | 2 | Victoria | 289625 | | British Columbia | 3 | Abbotsford | 151685 | | Manitoba | ... SELECT for CREATE: +----+-------------+--------+------+---------------+------+---------+------+------+----------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+------+---------------+------+---------+------+------+----------------+ | 1 | SIMPLE | Canada | ALL | NULL | NULL | NULL | NULL | 5484 | Using filesort | +----+-------------+--------+------+---------------+------+---------+------+------+----------------+ Other SELECT: +----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+ | 1 | SIMPLE | t | index | NULL | PRIMARY | 104 | NULL | 22 | Using where | +----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+ The main handler values (total of all operations): | Handler_read_rnd_next | 10970 | | Handler_write | 5484 | -- number of rows in Canada (write tmp table)Both "Top-n" forumulations probably take about the same amount of time.
SELECT c.* FROM ( SELECT province, MAX(population) AS population FROM canada GROUP BY province ) x JOIN canada c USING(province, population);
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Data Warehouse Overview
Summary Tables
High speed ingestion
Bulk Normalization
Schema and code design for large Sensor database
Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative
Lat/Lng search to Find the nearest 10 pizza parlors
Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Alter of a Huge table -- Mostly obviated by 5.6
Efficient List of Latest 10 news articles
Build and execute a Pivot SELECT (showing rows as columns)
(Groupwise Max): Efficiently find largest row(s) for each group
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips,
PHP tips,
other language tips
utf8 Collations
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
Handler counts
wp_postmeta
UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
Analyze SlowLog
My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: