Fetching Random Rows from a Table

The Problem

Metrics

Case: Consecutive AUTO_INCREMENT without gaps, 1 row returned

Case: Consecutive AUTO_INCREMENT without gaps, 10 rows

Case: AUTO_INCREMENT with gaps, 1 or more rows returned

Case: Extra FLOAT column for randomizing

Case: UUID or MD5 column

Postlog

Brought to you by Rick James

The Problem

(This is Version 2 of my RANDOM blog; the previous one had too many slow algorithms.)

In MySQL/MariaDB, one would like to do "SELECT ... ORDER BY RAND() LIMIT 10" to get 10 rows at random. But this is slow. The optimizer does
    ⚈  Fetch all the rows -- this is costly
    ⚈  Append RAND() to the rows
    ⚈  Sort the rows -- also costly
    ⚈  Pick the first 10.

All the algorithms given below are "fast", but most introduce flaws:
    ⚈  Bias -- some rows are more like to be fetched than others.
    ⚈  Repetitions -- If two random sets contain the same row, they are likely to contain other dups.
    ⚈  Sometimes failing to fetch the desired number of rows.

"Fast" means avoiding reading all the rows. There are many techniques that require a full table scan, or at least an index scan. They are not acceptable for this list. There is even a technique that averages half a scan; it is relegated to a footnote.

Metrics

Here's a way to measure performance without having a big table.

    FLUSH STATUS;
    SELECT ...;
    SHOW SESSION STATUS LIKE 'Handler%';

If some of the "Handler" numbers look like the number of rows in the table, then there was a table scan.

None of the queries presented here need a full table (or index) scan. Each has a time proportional to the number of rows returned.

Virtually all published algorithms involve a table scan. The previously published version of this blog had, embarassingly, several algorithms that had table scans.

Sometimes the scan can be avoided via a subquery. For example, the first of these will do a table scan; the second will not.

    SELECT *  FROM RandTest AS a
        WHERE id = FLOOR(@min + (@max - @min + 1) * RAND());  -- BAD: table scan
    SELECT *
        FROM RandTest AS a
        JOIN (
            SELECT FLOOR(@min + (@max - @min + 1) * RAND()) AS id -- Good; single eval.
             ) b  USING (id);

Case: Consecutive AUTO_INCREMENT without gaps, 1 row returned

⚈ Requirement: AUTO_INCREMENT id
⚈ Requirement: No gaps in id

    SELECT r.*
        FROM (
            SELECT FLOOR(mm.min_id + (mm.max_id - mm.min_id + 1) * RAND()) AS id
                FROM (
                    SELECT MIN(id) AS min_id,
                           MAX(id) AS max_id
                        FROM RandTest
                     ) AS mm
             ) AS init
        JOIN  RandTest AS r  ON r.id = init.id;

(Of course, you might be able to simplify this. For example, min_id is likely to be 1. Or precalculate limits into @min and @max.)

Case: Consecutive AUTO_INCREMENT without gaps, 10 rows

    ⚈  Requirement: AUTO_INCREMENT id
    ⚈  Requirement: No gaps in id
    ⚈  Flaw: Sometimes delivers fewer than 10 rows

    -- First select is one-time:
    SELECT @min := MIN(id),
           @max := MAX(id)
        FROM RandTest;
    SELECT DISTINCT *
        FROM RandTest AS a
        JOIN (
            SELECT FLOOR(@min + (@max - @min + 1) * RAND()) AS id
                FROM RandTest
                LIMIT 11    -- more than 10 (to compensate for dups)
             ) b  USING (id)
        LIMIT 10;           -- the desired number of rows

The FLOOR expression could lead to duplicates, hence the inflated inner LIMIT. There could (rarely) be so many duplicates that the inflated LIMIT leads to fewer than the desired 10 different rows. One approach to that Flaw is to rerun the query if it delivers too few rows.

A variant:

    SELECT r.*
        FROM (
            SELECT FLOOR(mm.min_id + (mm.max_id - mm.min_id + 1) * RAND()) AS id
                FROM (
                    SELECT MIN(id) AS min_id,
                           MAX(id) AS max_id
                        FROM RandTest
                     ) AS mm
                JOIN ( SELECT id dummy FROM RandTest LIMIT 11 ) z
             ) AS init
        JOIN  RandTest AS r  ON r.id = init.id
        LIMIT 10;

Again, ugly but fast, regardless of table size.

Case: AUTO_INCREMENT with gaps, 1 or more rows returned

    ⚈  Requirement: AUTO_INCREMENT, possibly with gaps due to DELETEs, etc
    ⚈  Flaw: Only semi-random (rows do not have an equal chance of being picked), but it does partially compensate for the gaps
    ⚈  Flaw: The first and last few rows of the table are less likely to be delivered.

This gets 50 "consecutive" ids (possibly with gaps), then delivers a random 10 of them.

    -- First select is one-time:
    SELECT @min := MIN(id),
           @max := MAX(id)
        FROM RandTest;
    SELECT a.*
        FROM RandTest a
        JOIN ( SELECT id FROM
                ( SELECT id
                    FROM ( SELECT @min + (@max - @min + 1 - 50) * RAND() AS start FROM DUAL ) AS init
                    JOIN RandTest y
                    WHERE    y.id > init.start
                    ORDER BY y.id
                    LIMIT 50           -- Inflated to deal with gaps
                ) z ORDER BY RAND()
               LIMIT 10                -- number of rows desired (change to 1 if looking for a single row)
             ) r ON a.id = r.id;

Yes, it is complex, but yes, it is fast, regardles of the table size.

Case: Extra FLOAT column for randomizing

(Unfinished: need to check these.)

Assuming rnd is a FLOAT (or DOUBLE) populated with RAND() and INDEXed:

    ⚈  Requirement: extra, indexed, FLOAT column
    ⚈  Flaw: Fetches 10 adjacent rows (according to rnd), hence not good randomness
    ⚈  Flaw: Near 'end' of table, can't find 10 rows.

    SELECT r.*
        FROM ( SELECT RAND() AS start FROM DUAL ) init
        JOIN RandTest r
        WHERE r.rnd >= init.start
        ORDER BY r.rnd
        LIMIT 10;

⚈ These two variants attempt to resolve the end-of-table flaw:

    SELECT r.*
        FROM ( SELECT RAND() * ( SELECT rnd
                          FROM RandTest
                          ORDER BY rnd DESC
                          LIMIT 10,1 ) AS start
             ) AS init
        JOIN RandTest r
        WHERE r.rnd > init.start
        ORDER BY r.rnd
        LIMIT 10;

    SELECT @start := RAND(),
           @cutoff := CAST(1.1 * 10 + 5 AS DECIMAL(20,8)) / TABLE_ROWS
        FROM information_schema.TABLES
        WHERE TABLE_SCHEMA = 'dbname'
          AND TABLE_NAME = 'RandTest'; -- 0.0030
    SELECT d.*
        FROM (
            SELECT a.id
                FROM RandTest a
                WHERE rnd BETWEEN @start AND @start + @cutoff
             ) sample
        JOIN RandTest d USING (id)
        ORDER BY rand()
        LIMIT 10;

Case: UUID or MD5 column

    ⚈  Requirement: UUID/GUID/MD5/SHA1 column exists and is indexed.
    ⚈  Similar code/benefits/flaws to AUTO_INCREMENT with gaps.
    ⚈  Needs 7 random HEX digits:

    RIGHT( HEX( (1<<24) * (1+RAND()) ), 6)

can be used as a start for adapting a gapped AUTO_INCREMENT case. If the field is BINARY instead of hex, then UNHEX(RIGHT( HEX( (1<<24) * (1+RAND()) ), 6))

Postlog

Original writing -- July, 2012; not adequate

Another possible solution: Drew's complex solution

Not adequately fast: Another Approach
And another
Excellent for randomness
Mixing random and pagination
Many more (mostly slow

Rewrite -- July, 2015

-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
Data Warehouse Overview Summary Tables High speed ingestion Bulk Normalization

Schema and code design for large Sensor database

Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative

5 methods for 'Find Nearest'

Lat/Lng search to Find the nearest 10 pizza parlors
Lat/Long representation choices

Z-Order 'find nearest'

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Efficient List of Latest 10 news articles

Build and execute a Pivot SELECT (showing rows as columns)

(Groupwise Max): Efficiently find largest row(s) for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
Trouble with UTF-8 If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips, PHP tips, other language tips
utf8 Collations utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
Many-to-many mapping table Handler counts wp_postmeta UNION+OFFSET

MySQL Limits -- built-in hard limits
767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS Analyze SlowLog

My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte and bagel ($10) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "ðŸ™‚"

Fetching Random Rows from a Table

Table of Contents

The Problem

Metrics

Case: Consecutive AUTO_INCREMENT without gaps, 1 row returned

Case: Consecutive AUTO_INCREMENT without gaps, 10 rows

Case: AUTO_INCREMENT with gaps, 1 or more rows returned

Case: Extra FLOAT column for randomizing

Case: UUID or MD5 column

Postlog

MySQL Documents by Rick James

Did my articles help you out? Like what you see? Consider donating: