Groupwise Max in MySQL

Table of Contents

The Problem

Sample Data

Duplicate Max

Using an Uncorrelated Subquery

Using "windowing functions"

Using @variables

LATERAL

The Duds

Top-N in each Group

Top-N in each Group, take II

Top-N in each Group, using Windowing functions

Top-N using MyISAM

Baron's trick

This works

Postlog

Brought to you by Rick James

The Problem


You want to find the largest row in each group of rows. An example is looking for the largest city in each state. While it is easy to find the MAX(population) ... GROUP BY state, it is hard to find the name of the city associated with that population. Alas, MySQL does not have any syntax to provide the solution directly.

This blog presents multiple "good" solutions. They differ in ways that make none of them 'perfect'; you should try each and weigh the pros and cons.

Also, a few "bad" solutions will be presented, together with why they were rejected.

MySQL manual
gives 3 solutions; only the "Uncorrelated" one is "good", the other two are "bad". The tip off of having terrible performance is s1.price < s2.price.

Sample Data


To show how the various coding attempts work, I have devised this simple task: Find the largest city in each Canadian province. Here's a sample of the source data (5493 rows):
+------------------+----------------+------------+
| province         | city           | population |
+------------------+----------------+------------+
| Saskatchewan     | Rosetown       |       2309 |
| British Columbia | Chilliwack     |      51942 |
| Nova Scotia      | Yarmouth       |       7500 |
| Alberta          | Grande Prairie |      41463 |
| Quebec           | Sorel          |      33591 |
| Ontario          | Moose Factory  |       2060 |
| Ontario          | Bracebridge    |       8238 |
| British Columbia | Nanaimo        |      84906 |
| Manitoba         | Neepawa        |       3151 |
| Alberta          | Grimshaw       |       2560 |
| Saskatchewan     | Carnduff       |        950 |
...

Here's the desired output (13 rows):
+---------------------------+---------------+------------+
| province                  | city          | population |
+---------------------------+---------------+------------+
| Alberta                   | Calgary       |     968475 |
| British Columbia          | Vancouver     |    1837970 |
| Manitoba                  | Winnipeg      |     632069 |
| New Brunswick             | Saint John    |      87857 |
| Newfoundland and Labrador | Corner Brook  |      18693 |
| Northwest Territories     | Yellowknife   |      15866 |
| Nova Scotia               | Halifax       |     266012 |
| Nunavut                   | Iqaluit       |       6124 |
| Ontario                   | Toronto       |    4612187 |
| Prince Edward Island      | Charlottetown |      42403 |
| Quebec                    | Montreal      |    3268513 |
| Saskatchewan              | Saskatoon     |     198957 |
| Yukon                     | Whitehorse    |      19616 |
+---------------------------+---------------+------------+

Duplicate Max


One thing to consider is whether you want -- or do not want -- to see multiple rows for tied winners. For the dataset being used here, that would imply that the two largest cities in a province had identical populations. For this case, a dup would be unlikely. But there are many groupwise-max use cases where dups are likely.

The two best algorithms differ in whether they show dups.

Using an Uncorrelated Subquery


Characteristics:
    ⚈  Superior performance or medium performance
    ⚈  It will show duplicates
    ⚈  Needs an extra index
    ⚈  Probably requires 5.6
    ⚈  If all goes well, it will run in O(M) where M is the number of output rows.

An 'uncorrelated subquery':
SELECT  c1.province, c1.city, c1.population
    FROM  Canada AS c1
    JOIN
      ( SELECT  province, MAX(population) AS population
            FROM  Canada
            GROUP BY  province
      ) AS c2 USING (province, population)
    ORDER BY c1.province;
But this also 'requires' an extra index: INDEX(province, population). In addition, MySQL has not always been able to use that index effectively, hence the "requires 5.6". (I am not sure of the actual version.)

Without that extra index, you would need 5.6, which has the ability to create indexes for subqueries. This is indicated by <auto_key0> in the EXPLAIN. Even so, the performance is worse with the auto-generated index than with the manually generated one.

With neither the extra index, nor 5.6, this 'solution' would belong in 'The Duds' becuase it would run in O(N*N) time.

Caveat: Some of the solutions below have problems with versions MariaDB 5.5 and MySQL 5.7 and later.

Using "windowing functions"


These OVER feature is available only in MySQL 8.0 and MariaDB 10.2 (or later).

Characteristics:
    ⚈  Medium performance
    ⚈  It will not show duplicates
    ⚈  Needs an extra index(province, population)
    ⚈  Requires MariaDB 10.2+ or MySQL 8.0+
    ⚈  Somewhat slower than O(N) where N is the number of input rows.
WITH rank_added AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY province ORDER BY population DESC) AS rn
        FROM canada
)
SELECT province, city, population FROM rank_added
         WHERE rn = 1;

Using @variables


Characteristics:
    ⚈  Good performance
    ⚈  Does not show duplicates (picks one to show)
    ⚈  Consistent O(N) run time (N = number of input rows)
    ⚈  Only one scan of the data
    ⚈  Assigning @variables is being deprecated (8.0.13), so don't use this
SELECT
        province, city, population   -- The desired columns
    FROM
      ( SELECT  @prev := '' ) init
    JOIN
      ( SELECT  province != @prev AS first,  -- `province` is the 'GROUP BY'
                @prev := province,           -- The 'GROUP BY'
                province, city, population   -- Also the desired columns
            FROM  Canada           -- The table
            ORDER BY
                province,          -- The 'GROUP BY'
                population DESC    -- ASC for MIN(population), DESC for MAX
            LIMIT 999999           -- kludge to keep the ORDER BY from being ignored
      ) x
    WHERE  first
    ORDER BY  province;     -- Whatever you like
For your application, change the lines with comments.

LATERAL


TBD (8.0.14)

The Duds


*** 'correlated subquery' (from MySQL doc):
SELECT  province, city, population
    FROM  Canada AS c1
    WHERE  population =
      ( SELECT  MAX(c2.population)
            FROM  Canada AS c2
            WHERE  c2.province= c1.province
      )
    ORDER BY  province;
O(N*N) (that is, terrible) performance

*** LEFT JOIN (from MySQL doc):
SELECT  c1.province, c1.city, c1.population
    FROM  Canada AS c1
    LEFT JOIN  Canada AS c2 ON c2.province = c1.province
      AND  c2.population > c1.population
    WHERE  c2.province IS NULL
    ORDER BY province;
Medium performance (2N-3N, depending on join_buffer_size).

For O(N*N) time,... It will take one second to do a groupwise-max on a few thousand rows; a million rows could take hours.

Top-N in each Group


This is a variant on "groupwise-max" wherein you desire the largest (or smallest) N items in each group. Do these substitutions for your use case:
    ⚈  province --> your 'GROUP BY'
    ⚈  Canada --> your table
    ⚈  3 --> how many of each group to show
    ⚈  population --> your numeric field for determining "Top-N"
    ⚈  city --> more field(s) you want to show
    ⚈  Change the SELECT and ORDER BY if you desire
    ⚈  DESC to get the 'largest'; ASC for the 'smallest'

A different O(N*N)
SELECT
        province, n, city, population
    FROM
      ( SELECT  @prev := '', @n := 0 ) init
    JOIN
      ( SELECT  @n := if(province != @prev, 1, @n + 1) AS n,
                @prev := province,
                province, city, population
            FROM  Canada
            ORDER BY
                province   ASC,
                population DESC
            LIMIT 999999           -- kludge to keep the ORDER BY from being ignored
      ) x
    WHERE  n <= 3
    ORDER BY  province, n;
Output:
+---------------------------+------+------------------+------------+
| province                  | n    | city             | population |
+---------------------------+------+------------------+------------+
| Alberta                   |    1 | Calgary          |     968475 |
| Alberta                   |    2 | Edmonton         |     822319 |
| Alberta                   |    3 | Red Deer         |      73595 |
| British Columbia          |    1 | Vancouver        |    1837970 |
| British Columbia          |    2 | Victoria         |     289625 |
| British Columbia          |    3 | Abbotsford       |     151685 |
| Manitoba                  |    1 | ...

The performance of this is O(N), actually about 3N, where N is the number of source rows.

EXPLAIN EXTENDED gives
+----+-------------+------------+--------+---------------+------+---------+------+------+----------+----------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | filtered | Extra          |
+----+-------------+------------+--------+---------------+------+---------+------+------+----------+----------------+
|  1 | PRIMARY     | <derived2> | system | NULL          | NULL | NULL    | NULL |    1 |   100.00 | Using filesort |
|  1 | PRIMARY     | <derived3> | ALL    | NULL          | NULL | NULL    | NULL | 5484 |   100.00 | Using where    |
|  3 | DERIVED     | Canada     | ALL    | NULL          | NULL | NULL    | NULL | 5484 |   100.00 | Using filesort |
|  2 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL |     NULL | No tables used |
+----+-------------+------------+--------+---------------+------+---------+------+------+----------+----------------+
Explanation, shown in the same order as the EXPLAIN, but numbered chronologically:
    3.  Get the subquery id=2 (init)
    4.  Scan the the output from subquery id=3 (x)
    2.  Subquery id=3 -- the table scan of Canada
    1.  Subquery id=2 -- init for simply initializing the two @variables
Yes, it took two sorts, though probably in RAM.

Main Handler values:
| Handler_read_rnd           | 39    |
| Handler_read_rnd_next      | 10971 |
| Handler_write              | 5485  |  -- #rows in Canada (+1)

Top-N in each Group, take II


This variant is faster than the previous, but...
    ⚈  depends on city being unique across the dataset
    ⚈  is limited to what GROUP_CONCAT() can handle (cf group_concat_max_len)

(from openark.org)
    SELECT  province, city, population
        FROM  Canada
        JOIN
          ( SELECT  GROUP_CONCAT(top_in_province) AS top_cities
                FROM
                  ( SELECT  SUBSTRING_INDEX(
                                   GROUP_CONCAT(city ORDER BY  population DESC),
                            ',', 3) AS top_in_province
                        FROM  Canada
                        GROUP BY  province
                  ) AS x
          ) AS y
        WHERE  FIND_IN_SET(city, top_cities)
        ORDER BY  province, population DESC;
Output. Note how there can be more than 3 cities per province. This is because "city" is not unique across the data set.
| Alberta                   | Calgary          |     968475 |
| Alberta                   | Edmonton         |     822319 |
| Alberta                   | Red Deer         |      73595 |
| British Columbia          | Vancouver        |    1837970 |
| British Columbia          | Victoria         |     289625 |
| British Columbia          | Abbotsford       |     151685 |
| British Columbia          | Sydney           |          0 | -- Nova Scotia's second largest is Sydney
| Manitoba                  | Winnipeg         |     632069 |

Main Handler values:
| Handler_read_next          | 5484  | -- table size
| Handler_read_rnd_next      | 5500  | -- table size + number of provinces
| Handler_write              | 14    | -- number of provinces (+1)

Top-N in each Group, using Windowing functions


MariaDB 10.2 and MySQL 8.0 added "windowing functions", which make "groupwise max" more straightforward.
SELECT *
    FROM (
        SELECT  province,
                ROW_NUMBER() OVER(PARTITION BY province ORDER BY population DESC) AS n,
                city, population
            FROM canada
         ) x
    WHERE n <= 3;

Notes:

    ⚈  The output is 'correct' (see first Top-N output)
    ⚈  This approach is faster than the previous Top-N techniques.
    ⚈  MariaDB hits every row in the original table about 7 times.
    ⚈  MySQL reads the entire table, writes that much to a temp table, then reads it back.

MariaDB's Windowing functions in 10.2+
MySQL 8.0 Windowing functions

Top-N using MyISAM


(This does not need your table to be MyISAM, but it does need MyISAM tmp table for its 2-column PRIMARY KEY feature.) See previous section for what changes to make for your use case.
    -- build tmp table to get numbering
    -- (Assumes auto_increment_increment = 1)
    CREATE TEMPORARY TABLE t (
        nth MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
        PRIMARY KEY(province, nth)
    ) ENGINE=MyISAM
        SELECT province, NULL AS nth, city, population
            FROM Canada
            ORDER BY population DESC;
    -- Output the biggest 3 cities in each province:
    SELECT province, nth, city, population
        FROM t
        WHERE nth <= 3
        ORDER BY province, nth;

+---------------------------+-----+------------------+------------+
| province                  | nth | city             | population |
+---------------------------+-----+------------------+------------+
| Alberta                   |   1 | Calgary          |     968475 |
| Alberta                   |   2 | Edmonton         |     822319 |
| Alberta                   |   3 | Red Deer         |      73595 |
| British Columbia          |   1 | Vancouver        |    1837970 |
| British Columbia          |   2 | Victoria         |     289625 |
| British Columbia          |   3 | Abbotsford       |     151685 |
| Manitoba                  |  ...

SELECT for CREATE:
+----+-------------+--------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows | Extra          |
+----+-------------+--------+------+---------------+------+---------+------+------+----------------+
|  1 | SIMPLE      | Canada | ALL  | NULL          | NULL | NULL    | NULL | 5484 | Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+------+----------------+
Other SELECT:
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | t     | index | NULL          | PRIMARY | 104     | NULL |   22 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
The main handler values (total of all operations):
| Handler_read_rnd_next      | 10970 |
| Handler_write              | 5484  |  -- number of rows in Canada (write tmp table)
Both "Top-n" forumulations probably take about the same amount of time.

Baron's trick


How to select the first/least/max row per group in SQL
looks promising enough for me to mention it here. When I get a chance I will compare it to the other suggestions.

Tentative Characteristics:
    ⚈  Good performance
    ⚈  Duplicates?
    ⚈  Needs an extra index
    ⚈  Consistent O(N) run time (N = number of input rows)
    ⚈  Only one scan of the data
    ⚈  Works for top 1 or top-N.

(That blog has been around for a decade; I am just now noticing it.)

This works


This works in all versions

limitations:

    ⚈  Requires INDEX(province, population)
    ⚈  There is only one MAX(population) per province (else it will display all such rows)
    ⚈  Can do only "top-1", not "top-N"
SELECT c.*
    FROM ( SELECT province, MAX(population) AS population
               FROM canada
               GROUP BY province ) x
    JOIN canada c USING(province, population);

Postlog


Developed and first posted, Feb, 2015;   Add MyISAM approach: July, 2015;   Openark's method added: Apr, 2016;   Windowing: Jan, 2020  

This has some of these algorithms, plus some others:
Peter Brawley's Within-group aggregates
Peter Brawley's Within-group quotas (Top N per group)
See also: Jan Kneschke's blog from 2007
StackOverflow discussion of 'Uncorrelated'

Other references: Latest id vs latest date
Inner ORDER BY thrown away
Adding a large LIMIT to a subquery may make things work. Why ORDER BY in subquery is ignored
StackOverflow thread
row_number(), rank(), dense_rank()
Perentile blog
top 2; Using OVER
Experimenting with these algorithms
Loose index scan
See 2nd
Old Top-1, but updated with windowing
Using Row Constructor?
-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
    Data Warehouse Overview   Summary Tables   High speed ingestion   Bulk Normalization  

Schema and code design for large Sensor database

Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative

5 methods for 'Find Nearest'

Lat/Lng search to Find the nearest 10 pizza parlors
    Lat/Long representation choices

Z-Order 'find nearest'

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Efficient List of Latest 10 news articles

Build and execute a Pivot SELECT (showing rows as columns)

(Groupwise Max): Efficiently find largest row(s) for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
    Many-to-many mapping table   Handler counts   wp_postmeta   UNION+OFFSET

MySQL Limits -- built-in hard limits
    767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte and bagel ($10) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"