Count the Disk Hits

Table of Contents

Count the Disk Hits

Indexes

Details locating a record(s) in a BTree

Devolve into _only_ 2 patterns

Point Query attributes

Range Scan attributes

NLJ - "Nested Loop Join"

WHERE clause Examples

GROUP BY

ORDER BY

"filesort"

LIMIT

OFFSET and LIMIT

Footnotes about MyISAM

Postlog

Brought to you by Rick James

Count the Disk Hits


CAUTION: Needs lots of editing!

This discusses how to estimate the performance of a SELECT on a 'large' table without actually running it.

This focuses on InnoDB, with a brief section on how MyISAM differs.

This document repeats itself because it attacks the same points from different angles. Appologies if this is too boring for you.

All numbers in this document are "Rules of Thumb", and can vary significantly. Still, they are useful for computing metrics for comparing INDEX choices, etc.

Indexes


    ⚈  PRIMARY KEY -- A BTree with the table's fields in the leaves.
    ⚈  Secondary KEY -- A BTree with the secondary fields and the PK fields in the leaves.
    ⚈  ??JOIN -- Reach into another table.
    ⚈  "Covering" INDEX -- A secondary index whose leaves happen to contain all the fields needed.

To find a row via a PRIMARY KEY, do one BTree probe.

To find a row via a secondary KEY,
    ⚈  probe the BTree for the seconary KEY,
    ⚈  read the PRIMARY KEY (sitting in the leaf of that BTree),
    ⚈  probe the PRIMARY KEY BTree to get the rest of the fields.

Think of a PRIMARY KEY as a BTree whose payload is all the fields of the 'table'.

Think of a secondary index as a separate BTree whose payload is the fields of the PRIMARY KEY for that table.

Now, stop thinking about "tables" and only think about "BTrees".

A "Covering" INDEX is A secondary index whose leaves contain all the fields needed.

Details locating a record(s) in a BTree


    ⚈  Drill down the BTree to a 16KB block.
    ⚈  Locate the record in the block
    ⚈  If "range" scan, walk forward to next record(s)

    ⚈  cost=0 for non-leaf nodes of BTree. (~3 levels for 1M rows; ~5 levels for 1B rows)
    ⚈  cost=1 for leaf node.
    ⚈  cost=1 per 100 records in range scan (think: 100 records per block)

Reading a node = reading a 16KB block = 10ms.

Devolve into only 2 patterns


Most of the cost of a SELECT boils down to two operations, applied as often as needed:
    ⚈  Given a value, dive into a BTree to find one row -- "point query"
    ⚈  Given a range of values, (1) dive into a BTree, then scan forward -- "range scan"

Point Query attributes


Cost = 1. We assume the non-leaf nodes are cached in the buffer_pool, but the leaf node requires a disk hit. This assumption tends to be valid for tables that are bigger than the buffer_pool, yet smaller than 100 times innodb_buffer_pool_size. (Exception: "hot" record = those that are fetched frequently.)

When the 'record' is fetched, you have all the fields in the record.
    ⚈  For a PRIMARY KEY BTree, this is all the fields in the table
    ⚈  For a secondary KEY BTree, this includes the fields of both the secondary KEY and of the PRIMARY KEY.

Examples of "point queries":
    ⚈  PRIMARY KEY(id) - WHERE id = 123
    ⚈  UNIQUE(id) - WHERE id = 123
    ⚈  UNIQUE(x) - WHERE x LIKE 'mno' -- no wild cards, so same as '='
An IN clause can usually be thought of as multiple point queries.

Range Scan attributes


Examples of "ranges":
    ⚈  INDEX(a) - WHERE a > 123
    ⚈  INDEX(a) - WHERE a BETWEEN 11 AND 22
    ⚈  INDEX(name) - WHERE name LIKE 'J%' -- but not WHERE name LIKE '%J'
    ⚈  INDEX(a,b) - WHERE a=123 AND b < 98
    ⚈  INDEX(a,b) - WHERE a=123 -- and no constraint on b
    ⚈  INDEX(a) - WHERE a = 123 -- Since a is not UNIQUE, it will scan until a > 123, thereby hitting an extra row

A range scan involves two steps:
    ⚈  a point query to find the first item in the range (cost = 1),
    ⚈  read consecutive records from that block, then link to subsequent block(s).

Assume, for simplicity, that there are 100 rows per block. This can be way off, but usually the error is minor in the grand scheme of things.

To simplify the cost, think of it as either
    ⚈  cost=1 for a "few" records, or
    ⚈  cost = N/100 if "lots" of records

NLJ - "Nested Loop Join"


This is a term that the manual talks about. It refers to
    ⚈  Find the JOIN key in one table, then
    ⚈  Reach into the other table.
A NLJ invoves 1 or 2 BTrees in the first table, then 1 or 2 BTrees in the second table.

In this document, you should ignore NLJ, focusing instead on the BTrees that it devolves into.

WHERE clause Examples


For this section let's look at SELECTs without GROUP BY or ORDER BY. Lets assume these tables:
    CREATE TABLE Main (
        ...,
        PRIMARY KEY (id),
        INDEX(foo_id) -- Think of this as a table with (foo_id, id)
    )
    CREATE TABLE Foo (
        ...,
        PRIMARY KEY (foo_id),
        INDEX(bar)
    )

    ⚈  SELECT * WHERE id=123 -- 1 BTree, cost=1 -- "point query"
    ⚈  SELECT * WHERE foo_id=98 -- 2 BTrees, cost=2
    ⚈  SELECT id WHERE foo_id=98 -- 1 BTrees, cost=1 (Covering)
    ⚈  SELECT * JOIN USING foo_id WHERE bar='x' -- 3 BTrees, cost=3
    ⚈  SELECT id JOIN USING foo_id WHERE bar='x' -- 2 BTrees, cost=2 (Covering)

Now let's look up multiple rows. Let's assume there are N items in each BETWEEN or IN clause.
    ⚈  SELECT * WHERE id BETWEEN ... -- 1 BTree, cost = 1 + N/100 -- "table scan"
    ⚈  SELECT * WHERE id IN (...) -- 1 BTree, cost <= N -- possibly hitting lots of blocks; think of it is multiple 'point queries'
    ⚈  SELECT id WHERE foo_id=98 -- 1 BTree, cost=1 (Covering) -- "index scan"
    ⚈  SELECT * WHERE foo_id BETWEEN ... -- 2 BTrees, cost = (1 + N/100) + N -- possibly random lookups to reach for '*'
Notice how this can do a "range scan" on the INDEX BTree, but then has to random point queries into the PRIMARY KEY BTree.
    ⚈  SELECT * JOIN USING foo_id WHERE bar BETWEEN ... -- 3 BTrees, cost = (1 + N/100) + N + N

GROUP BY


A GROUP BY may be able to do the "grouping" in one pass. Or it may need to gather the data into a temp table, sort that, then do the grouping.

These should be possible in a single pass:
    ⚈  INDEX (a) - WHERE 1 GROUP BY a
    ⚈  INDEX (a,b) - WHERE a=123 GROUP BY b
    ⚈  INDEX (a,b) - WHERE 1 GROUP BY a,b

(I say "should be possible", because the optimizer has a few glitches.)

ORDER BY


An ORDER BY may be able to use an INDEX, thereby avoiding a separate 'filesort' pass.

These should be possible without a filesort:
    ⚈  INDEX (a) - WHERE 1 ORDER BY a
    ⚈  INDEX (a) - WHERE 1 ORDER BY a DESC
    ⚈  INDEX (a,b) - WHERE a=123 ORDER BY b
    ⚈  INDEX (a,b) - WHERE a=123 ORDER BY b DESC
    ⚈  INDEX (a,b) - WHERE 1 ORDER BY a,b
    ⚈  INDEX (a,b) - WHERE 1 ORDER BY a DESC, b DESC

These require a filesort
    ⚈  INDEX (a,b) - WHERE 1 ORDER BY a ASC, b DESC
    ⚈  INDEX (a,b) - WHERE 1 ORDER BY a,b

If you have both a WHERE and an ORDER BY, the optimizer has three choices:
    ⚈  If an INDEX perfectly matches both the WHERE and the ORDER BY (as above), use it.
    ⚈  Use an index for the WHERE, then do a filesort for the ORDER BY.
    ⚈  Fetch the rows based on the ORDER BY, and filter on the WHERE as the rows are fetched.
When given a choice between the second and third, the optimizer often does not have sufficient insight into the data to make a good decision.

"filesort"


"Using temporary, using filesort" can show up in the EXPLAIN. But it is a misnomer.

The "temporary" may be a RAM-only table, hence very efficient. Even if it spills to disk as a MyISAM table, it is not all that 'bad'.

A "filesort" in RAM is plenty fast (think C's qsort).

Some cases
    ⚈  GROUP BY x ORDER BY x -- identical to GROUP BY x
    ⚈  GROUP BY x ORDER BY y -- requires 1 or 2 sorts.

LIMIT


If (and only if) an index handles all of the WHERE, GROUP BY, and ORDER BY will the LIMIT be efficient. If "filesort" is mentioned in the EXPLAIN, then think of the query as being first performed without the LIMIT clause, then a LIMIT's worth of rows are delivered to the client.

Examples:
    ⚈  INDEX(a) - WHERE 1 ORDER BY a LIMIT 5 -- reads only 5 rows
    ⚈  INDEX(a) - WHERE a>123 ORDER BY a LIMIT 5 -- reads only 5 rows
    ⚈  INDEX(a) - WHERE 1 ORDER BY bb LIMIT 5 -- reads and sorts entire table

OFFSET and LIMIT


LIMIT 500, 10 (or, equivalently LIMIT 10 OFFSET 500) must read 510 rows from the table (or the temp table after dealing with an ORDER BY).

Corollary: Pagination via OFFSET and LIMIT gets slower and slower. Instead, remember the id where you "left off" instead of using OFFSET.

Footnotes about MyISAM


    ⚈  MyISAM data (.MYD file) is not a BTree; it is accessed by an offset (think 'seek') that is either a byte offset or record number.
    ⚈  All indexes, including the PK, are in the .MYI file and are BTree-structured.
    ⚈  BTree blocks are 1KB.
    ⚈  The BTrees are cached in the key_buffer, configured of size key_buffer_size.
    ⚈  Data is cached by the OS, so leave some room for it.
    ⚈  The leaves of the indexes have a pointer into the .MYD file (not the PK).
    ⚈  Because of different leaf contents, don't expect PK fields to be automatically "covered" by a secondary key.

Postlog


I coined "Count the disk hits" in 2009 or earlier. Original writing of this discussion -- June, 2013
-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
    Data Warehouse Overview   Summary Tables   High speed ingestion   Bulk Normalization  

Schema and code design for large Sensor database

Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative

5 methods for 'Find Nearest'

Lat/Lng search to Find the nearest 10 pizza parlors
    Lat/Long representation choices

Z-Order 'find nearest'

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Efficient List of Latest 10 news articles

Build and execute a Pivot SELECT (showing rows as columns)

(Groupwise Max): Efficiently find largest row(s) for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
    Many-to-many mapping table   Handler counts   wp_postmeta   UNION+OFFSET

MySQL Limits -- built-in hard limits
    767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte and bagel ($10) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"