Fetching Random Rows from a Table

Table of Contents

The Problem
Case: Consecutive AUTO_INCREMENT
Case: AUTO_INCREMENT with gaps
Case: Add a FLOAT column
Case: 50/10 trick
Case: 'Digest' column
Case: Rebuild index in background
Case: Are there more cases?
Postlog
Brought to you by Rick James

The Problem


One would like to do "SELECT ... ORDER BY RAND() LIMIT 10" to get 10 rows at random. But this is slow. The optimizer does
    ⚈  Fetch all the rows -- this is costly
    ⚈  Append RAND() to the rows
    ⚈  Sort the rows -- also costly
    ⚈  Pick the first 10.

All the algorithms given below are "fast", but most introduce flaws:
    ⚈  Bias -- some rows are more like to be fetched than others.
    ⚈  Repetitions -- If two random sets contain the same row, they are likely to contain other dups.
    ⚈  Sometimes failing to fetch the desired number of rows.

"Fast" means avoiding reading all the rows.

Case: Consecutive AUTO_INCREMENT

    ⚈  AUTO_INCREMENT id
    ⚈  No gaps in id

Solution (for 1 row):
    SELECT @min := MIN(id),
           @max := MAX(id)
        FROM tbl;
    SELECT ... FROM tbl
        WHERE id = FLOOR(@min + (@max - @min + 1) * RAND());
(Of course, you might be able to optimize this. For example, @min is likely to be 1.)

Notes:
    ⚈  Very fast for 1 row.
    ⚈  Must fetch multiple rows with multiple queries.
    ⚈  None of the "flaws" occur.

Case: AUTO_INCREMENT with gaps

    ⚈  AUTO_INCREMENT, but there are gaps -- caused by DELETEs, etc

Solution: This gets 50 "consecutive" ids (possibly with gaps), then delivers a random 10 of them.
    SELECT @min := MIN(id),
           @max := MAX(id)
        FROM tbl;
    SELECT ...
        FROM tbl a
        JOIN (
                ( SELECT id
                    FROM tbl
                    WHERE id = FLOOR(@min + (@max - @min + 1 - 50) * RAND())
                    ORDER BY id
                    LIMIT 50
                ) ORDER BY RAND()
                LIMIT 10
             ) r ON a.id = r.id

Notes:
    ⚈  This technique does not give each row an equal chance of being picked, but it is reasonably fast.
    ⚈  It can deliver too few rows if it is at the end of the table
    ⚈  It rarely delivers the first few rows of the table.
    ⚈  The 50/10 trick cuts down on the bias and repetitions, but does not eliminate it.

Case: Add a FLOAT column

    ⚈  You can add a column to the table.

Solution: Setup:
    ALTER TABLE tbl
        ADD COLUMN rnd FLOAT NOT NULL
        ADD INDEX(rnd);
    UPDATE tbl SET rnd = RAND();
In subsequents INSERTs, set the `rnd` to RAND(). To fetch n random rows:
    SELECT ... FROM tbl
        WHERE rnd >= RAND()
        ORDER BY rnd
        LIMIT n;

Notes:
    ⚈  Because RAND() values are not equidistant, some rows are more likely to be fetched than others.
    ⚈  Consecutive rows show up together
    ⚈  You can work around these issues (at a price): UPDATE tbl SET rnd = RAND();
    ⚈  Sometimes it will deliver fewer than n rows.

Case: 50/10 trick

    ⚈  You can add a column to the table.
    ⚈  You want better results.

Solution: (This adds the 50/10 trick to the above solution.) Setup:
    ALTER TABLE tbl
        ADD COLUMN rnd FLOAT NOT NULL
        ADD INDEX(rnd);
    UPDATE tbl SET rnd = RAND();
In subsequents INSERTs, set the `rnd` to RAND(). To fetch n random rows:
    SELECT ...
        FROM tbl a
        JOIN (
                ( SELECT rnd
                    FROM tbl
                    WHERE rnd = @min + (@max - @min + 1 - 50) * RAND()
                    ORDER BY rnd
                    LIMIT 50
                ) ORDER BY RAND()
                LIMIT 10
             ) r ON a.rnd = r.rnd
            LIMIT 10

Notes:
    ⚈  The extra LIMIT 10 exists for the rare case where `rnd` has duplicate values.
    ⚈  If there are dup values in `rnd`, you will usually get the corresponding rows together.

Case: 'Digest' column

    ⚈  You have an indexed column that is a MD5 or SHA-1 or similar `digest`.

Solution: This gets 50 "consecutive" ids (possibly with gaps), then delivers a random 10 of them.
    SELECT ...
        FROM tbl a
        JOIN (
                ( SELECT digest
                    FROM tbl
                    WHERE digest > MD5(RAND()))
                    ORDER BY digest
                    LIMIT 50
                ) ORDER BY RAND()
                LIMIT 10
             ) r ON a.id = r.id

Notes:
    ⚈  The "50" can be adjusted; it is a compromise between speed and bias.
    ⚈  Because of the randomness of a digest, the there tend to be no huge gaps.
    ⚈  The range of a digest is predictable, hence no need for things like @max.
    ⚈  GUIDs (UUIDs) might work well with this algorithm.
    ⚈  The 50/10 trick cuts down on the bias and repetitions, but does not eliminate it.

Case: Rebuild index in background

    ⚈  You can afford some background processing
    ⚈  You don't have to get the most "recent" rows when fetching.

Solution: Recreate a random lookup table once an hour:
    CREATE TABLE IF NOT EXISTS NewRandom (
        rid INT UNSIGNED NOT NULL AUTO_INCREMENT,
        PRIMIARY KEY(rid, other_id)
    )
    SELECT id AS other_id FROM tbl ORDER BY RAND();  -- slow, but 'background'
    # Note: That table will have 2 columns
    DROP TABLE IF EXISTS OldRandom;
    RENAME TABLE Random TO OldRandom, NewRandom TO Random; -- atomic and instantaneous
Fetch 10 random rows:
    SELECT @max := MAX(rid) FROM Random;
    SELECT ... FROM tbl t
        JOIN Random r ON r.other_id = t.id
        WHERE rid > (@max - 10 + 1) * RAND()
        ORDER BY rid
        LIMIT 10;

Notes:
    ⚈  No 'Bias' flaw.
    ⚈  May exhibit 'Repitition' flaw until the hour is up.
    ⚈  May lock table during the hourly rebuild.
    ⚈  You won't see 'recently' added rows.

Case: Are there more cases?

    ⚈  Sure, there are more techniques.
    ⚈  No, there is not a 'perfect' solution.

Postlog


Original writing -- July, 2012

Another Approach
And another


Contact me by posting a question at MySQL Forums :: Performance
-- Rick James

Rick's MySQL Documents

MySQL Documents by Rick James

Tips, Debugging, HowTos, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Converting from MyISAM to InnoDB -- includes differences between them
Big DELETEs - how to optimize
Compound INDEXes plus other insights into the mysteries of INDEXing
Partition Maintenance (DROP+REORG) for time series
Entity-Attribute-Value -- a common, poorly performing, design patter; plus an alternative
Find the nearest 10 pizza parlors (efficient searching on Latitude + Longitude)
Alter of a Huge table
Latest 10 news articles -- how to optimize the schema and code for such
Pagination, not with OFFSET, LIMIT
Data Warehouse techniques (esp., Summary Tables)
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance
MySQL Limits
Galera Limitations (with Percona XtraDB Cluster / MariaDB)
Rollup Unique User Counts
Best of MySQL Forum