Fetching Random Rows from a Table

Table of Contents

CAVEAT
The Problem
Case: Consecutive AUTO_INCREMENT
Case: AUTO_INCREMENT with gaps
Case: Add a FLOAT column
Case: 50/10 trick
Case: 'Digest' column
Case: Rebuild index in background
Case: MySQL 5.6.2
Case: Full scan, but less to process
Case: Are there more cases?
Postlog
Brought to you by Rick James

CAVEAT


SIGNIFICANT FLAWS EXIST IN THIS DOCUMENT; IT WILL SOON EB REPLACED BY New Random



The Problem


One would like to do "SELECT ... ORDER BY RAND() LIMIT 10" to get 10 rows at random. But this is slow. The optimizer does
    ⚈  Fetch all the rows -- this is costly
    ⚈  Append RAND() to the rows
    ⚈  Sort the rows -- also costly
    ⚈  Pick the first 10.

All the algorithms given below are "fast", but most introduce flaws:
    ⚈  Bias -- some rows are more like to be fetched than others.
    ⚈  Repetitions -- If two random sets contain the same row, they are likely to contain other dups.
    ⚈  Sometimes failing to fetch the desired number of rows.

"Fast" means avoiding reading all the rows.

Case: Consecutive AUTO_INCREMENT

    ⚈  AUTO_INCREMENT id
    ⚈  No gaps in id

Solution (for 1 row):
    SELECT r.*
        FROM (
            SELECT FLOOR(mm.min_id + (mm.max_id - mm.min_id + 1) * RAND()) AS id
                FROM (
                    SELECT MIN(id) AS min_id,
                           MAX(id) AS max_id
                        FROM RandTest
                     ) AS mm
             ) AS init
        JOIN  RandTest AS r  ON r.id = init.id;
(Of course, you might be able to optimize this. For example, @min is likely to be 1.)

Notes:
    ⚈  Very fast for 1 row.
    ⚈  Must fetch multiple rows with multiple queries.
    ⚈  None of the "flaws" occur.

For multiple rows, the following is very efficient, but has a flaw...
    SELECT DISTINCT *
        FROM tbl AS a
        JOIN (
            SELECT FLOOR(@min + (@max - @min + 1) * RAND()) AS id
                FROM tbl
                LIMIT 11    -- more than 10
             ) b  USING (id)
        LIMIT 10;           -- 10

Notes:
    ⚈  Very fast.
    ⚈  The subquery may return duplicate ids. Having a larger LIMIT for it will _usually_ suffice.
    ⚈  None of the other "flaws" occur.

Case: AUTO_INCREMENT with gaps

    ⚈  AUTO_INCREMENT, but there are gaps -- caused by DELETEs, etc

Solution: This gets 50 "consecutive" ids (possibly with gaps), then delivers a random 10 of them.
    SELECT @min := MIN(id),
           @max := MAX(id)
        FROM tbl;
    SELECT ...
        FROM tbl a
        JOIN (
                ( SELECT id
                    FROM tbl
                    WHERE id = FLOOR(@min + (@max - @min + 1 - 50) * RAND())
                    ORDER BY id
                    LIMIT 50
                ) ORDER BY RAND()
                LIMIT 10
             ) r ON a.id = r.id

Notes:
    ⚈  This technique does not give each row an equal chance of being picked, but it is reasonably fast.
    ⚈  It can deliver too few rows if it is at the end of the table
    ⚈  It rarely delivers the first few rows of the table.
    ⚈  The 50/10 trick cuts down on the bias and repetitions, but does not eliminate it.

Case: Add a FLOAT column

    ⚈  You can add a column to the table.

Solution: Setup:
    ALTER TABLE tbl  ADD COLUMN rnd FLOAT NOT NULL;
    UPDATE tbl SET rnd = RAND();
    ALTER TABLE tbl  ADD INDEX(rnd);
In subsequents INSERTs, set the `rnd` to RAND(). To fetch n random rows:
    SELECT ... FROM tbl
        WHERE rnd >= RAND()
        ORDER BY rnd
        LIMIT n;

Notes:
    ⚈  Because RAND() values are not equidistant, some rows are more likely to be fetched than others.
    ⚈  Consecutive rows show up together
    ⚈  You can work around these issues (at a price): UPDATE tbl SET rnd = RAND();
    ⚈  Sometimes it will deliver fewer than n rows.

Case: 50/10 trick

    ⚈  You can add a column to the table.
    ⚈  You want better results.

Solution: (This adds the 50/10 trick to the FLOAT solution.) Setup:
    ALTER TABLE tbl
        ADD COLUMN rnd FLOAT NOT NULL
        ADD INDEX(rnd);
    UPDATE tbl SET rnd = RAND();
In subsequents INSERTs, set the `rnd` to RAND(). To fetch n random rows:
    SELECT ...
        FROM tbl a
        JOIN (
                ( SELECT rnd
                    FROM tbl
                    WHERE rnd = @min + (@max - @min + 1 - 50) * RAND()
                    ORDER BY rnd
                    LIMIT 50
                ) ORDER BY RAND()
                LIMIT 10
             ) r ON a.rnd = r.rnd
            LIMIT 10;

Notes:
    ⚈  The extra LIMIT 10 exists for the rare case where `rnd` has duplicate values.
    ⚈  If there are dup values in `rnd`, you will usually get the corresponding rows together.

Case: 'Digest' column

    ⚈  You have an indexed column that is a MD5 or SHA-1 or similar `digest`.

Solution: This gets 50 "consecutive" ids (possibly with gaps), then delivers a random 10 of them.
    SELECT ...
        FROM tbl a
        JOIN (
                ( SELECT digest
                    FROM tbl
                    WHERE digest > MD5(RAND()))
                    ORDER BY digest
                    LIMIT 50
                ) ORDER BY RAND()
                LIMIT 10
             ) r ON a.id = r.id

Notes:
    ⚈  The "50" can be adjusted; it is a compromise between speed and bias.
    ⚈  Because of the randomness of a digest, there tend to be no huge gaps.
    ⚈  The range of a digest is predictable, hence no need for things like @max.
    ⚈  GUIDs (UUIDs) might work well with this algorithm.
    ⚈  The 50/10 trick cuts down on the bias and repetitions, but does not eliminate it.
    ⚈  If the digest has been turned into BINARY(16), then UNHEX(MD5(RAND()))

Case: Rebuild index in background

    ⚈  You can afford some background processing
    ⚈  You don't have to get the most "recent" rows when fetching.

Solution: Recreate a random lookup table once an hour:
    CREATE TABLE IF NOT EXISTS NewRandom (
        rid INT UNSIGNED NOT NULL AUTO_INCREMENT,
        PRIMIARY KEY(rid, other_id)
    )
    SELECT id AS other_id FROM tbl ORDER BY RAND();  -- slow, but 'background'
    # Note: That table will have 2 columns
    DROP TABLE IF EXISTS OldRandom;
    RENAME TABLE Random TO OldRandom, NewRandom TO Random; -- atomic and instantaneous
Fetch 10 random rows:
    SELECT @max := MAX(rid) FROM Random;
    SELECT ... FROM tbl t
        JOIN Random r ON r.other_id = t.id
        WHERE rid > (@max - 10 + 1) * RAND()
        ORDER BY rid
        LIMIT 10;

Notes:
    ⚈  No 'Bias' flaw.
    ⚈  May exhibit 'Repitition' flaw until the hour is up.
    ⚈  May lock table during the hourly rebuild.
    ⚈  You won't see 'recently' added rows.

Case: MySQL 5.6.2


The optimizer in 5.6.2 has a minor speedup for this type of query
SELECT col1, ... FROM t1 ... ORDER BY RAND() LIMIT 15;
but the other suggestions listed in this blog will _probably_ be faster.

Reference: More about the 5.6.2 semi-optimization is found in
Limit Optimization

Case: Full scan, but less to process


Step 1 -- Compute a cutoff based on the number of rows in the table:
    SELECT @cutoff := CAST(1.1 * 10 + 5 AS DECIMAL(20,8)) / TABLE_ROWS
        FROM information_schema.TABLES
        WHERE TABLE_SCHEMA = 'db'
          AND TABLE_NAME = 'tbl';

Step 2 -- Scan the table to find 10 random rows -- Plan A:
    SELECT *
        FROM tbl
        WHERE RAND() <= @cutoff
        ORDER BY rand()
        LIMIT 10;
Step 2, Plan B:
    SELECT d.*
        FROM (
            SELECT a.id
                FROM tbl a
                WHERE RAND() <= @cutoff
            ) sample
        JOIN tbl d USING (id)
        ORDER BY rand();
        LIMIT 10;

Notes and assumptions:
    ⚈  10 rows are desired; replace as needed (in a few places).
    ⚈  The database name ('db' - once) and table name ('tbl' - a few times) need replacing.
    ⚈  The 1.1 and +5 are attempts to get enough rows in spite of the randomness of the algorithm; tweak if needed.
    ⚈  The code assumes that you want all the columns from the table; otherwise adjust the "d.*".
    ⚈  Most of the other schemes work by approximating random probes; this is one really random.
    ⚈  5.6.2 added a small optimization to ORDER BY rand() LIMIT n;
    ⚈  Depending on a variety of things, Plan A or Plan B might be faster; experiment.
    ⚈  The outer SELECT in Plan B very fast because it involves vew rows.
    ⚈  You can use SHOW SESSION STATUS LIKE 'Handler%' to validate that it 'reads' only about N+10 rows, where N is size of table.

The 'obvious' SELECT * FROM tbl ORDER BY RAND() LIMIT 10; is several times as slow, mostly because it is hauling around "*", then tossing most of that data. The secret sauce in this technique is that it avoids that overhead.

(Thanks to Ollie Jones on stackoverflow.com.)

Case: Are there more cases?

    ⚈  Sure, there are more techniques.
    ⚈  No, there is not a 'perfect' solution.

Postlog


Original writing -- July, 2012

Another Approach
And another


Contact me by posting a question at MySQL Forums :: Performance
-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Data Warehouse techniques: Overview   Summary Tables   High speed ingestion  
Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Build and execute a "Pivot" SELECT (showing rows as columns)
Find largest row for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
MySQL Limits -- built-in hard limits
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Best of MySQL Forum -- index of lots of tips, discussions, etc

View Rick James's profile on LinkedIn