Count the Disk Hits
CAUTION: Needs lots of editing!
This discusses how to estimate the performance of a SELECT on a 'large' table without actually running it.
This focuses on InnoDB, with a brief section on how MyISAM differs.
This document repeats itself because it attacks the same points from different angles. Appologies if this is too boring for you.
All numbers in this document are "Rules of Thumb", and can vary significantly. Still, they are useful for computing metrics for comparing INDEX choices, etc.
⚈ PRIMARY KEY -- A BTree with the table's fields in the leaves.
⚈ Secondary KEY -- A BTree with the secondary fields and the PK fields in the leaves.
⚈ ??JOIN -- Reach into another table.
⚈ "Covering" INDEX -- A secondary index whose leaves happen to contain all the fields needed.
To find a row via a PRIMARY KEY, do one BTree probe.
To find a row via a secondary KEY,
⚈ probe the BTree for the seconary KEY,
⚈ read the PRIMARY KEY (sitting in the leaf of that BTree),
⚈ probe the PRIMARY KEY BTree to get the rest of the fields.
Think of a PRIMARY KEY as a BTree whose payload is all the fields of the 'table'.
Think of a secondary index as a separate BTree whose payload is the fields of the PRIMARY KEY for that table.
Now, stop thinking about "tables" and only think about "BTrees".
A "Covering" INDEX is A secondary index whose leaves contain all the fields needed.
Details locating a record(s) in a BTree
⚈ Drill down the BTree to a 16KB block.
⚈ Locate the record in the block
⚈ If "range" scan, walk forward to next record(s)
⚈ cost=0 for non-leaf nodes of BTree. (~3 levels for 1M rows; ~5 levels for 1B rows)
⚈ cost=1 for leaf node.
⚈ cost=1 per 100 records in range scan (think: 100 records per block)
Reading a node = reading a 16KB block = 10ms.
Devolve into _only_ 2 patterns
Most of the cost of a SELECT boils down to two operations, applied as often as needed:
⚈ Given a value, dive into a BTree to find one row -- "point query"
⚈ Given a range of values, (1) dive into a BTree, then scan forward -- "range scan"
Point Query attributes
Cost = 1. We assume the non-leaf nodes are cached in the buffer_pool, but the leaf node requires a disk hit. This assumption tends to be valid for tables that are bigger than the buffer_pool, yet smaller than 100 times innodb_buffer_pool_size. (Exception: "hot" record = those that are fetched frequently.)
When the 'record' is fetched, you have all the fields in the record.
⚈ For a PRIMARY KEY BTree, this is all the fields in the table
⚈ For a secondary KEY BTree, this includes the fields of both the secondary KEY and of the PRIMARY KEY.
Examples of "point queries":
⚈ PRIMARY KEY(id) - WHERE id = 123
⚈ UNIQUE(id) - WHERE id = 123
⚈ UNIQUE(x) - WHERE x LIKE 'mno' -- no wild cards, so same as '='
An IN clause can usually be thought of as multiple point queries.
Range Scan attributes
Examples of "ranges":
⚈ INDEX(a) - WHERE a > 123
⚈ INDEX(a) - WHERE a BETWEEN 11 AND 22
⚈ INDEX(name) - WHERE name LIKE 'J%' -- but not WHERE name LIKE '%J'
⚈ INDEX(a,b) - WHERE a=123 AND b < 98
⚈ INDEX(a,b) - WHERE a=123 -- and no constraint on `b`
⚈ INDEX(a) - WHERE a = 123 -- Since `a` is not UNIQUE, it will scan until a > 123, thereby hitting an extra row
A range scan involves two steps:
⚈ a point query to find the first item in the range (cost = 1),
⚈ read consecutive records from that block, then link to subsequent block(s).
Assume, for simplicity, that there are 100 rows per block. This can be way off, but usually the error is minor in the grand scheme of things.
To simplify the cost, think of it as either
⚈ cost=1 for a "few" records, or
⚈ cost = N/100 if "lots" of records
NLJ - "Nested Loop Join"
This is a term that the manual talks about. It refers to
⚈ Find the JOIN key in one table, then
⚈ Reach into the other table.
A NLJ invoves 1 or 2 BTrees in the first table, then 1 or 2 BTrees in the second table.
In this document, you should ignore NLJ, focusing instead on the BTrees that it devolves into.
WHERE clause Examples
For this section let's look at SELECTs without GROUP BY or ORDER BY. Lets assume these tables:
CREATE TABLE Main (
PRIMARY KEY (id),
INDEX(foo_id) -- Think of this as a table with (foo_id, id)
CREATE TABLE Foo (
PRIMARY KEY (foo_id),
⚈ SELECT * WHERE id=123 -- 1 BTree, cost=1 -- "point query"
⚈ SELECT * WHERE foo_id=98 -- 2 BTrees, cost=2
⚈ SELECT id WHERE foo_id=98 -- 1 BTrees, cost=1 (Covering)
⚈ SELECT * JOIN USING foo_id WHERE bar='x' -- 3 BTrees, cost=3
⚈ SELECT id JOIN USING foo_id WHERE bar='x' -- 2 BTrees, cost=2 (Covering)
Now let's look up multiple rows. Let's assume there are N items in each BETWEEN or IN clause.
⚈ SELECT * WHERE id BETWEEN ... -- 1 BTree, cost = 1 + N/100 -- "table scan"
⚈ SELECT * WHERE id IN (...) -- 1 BTree, cost <= N -- possibly hitting lots of blocks; think of it is multiple 'point queries'
⚈ SELECT id WHERE foo_id=98 -- 1 BTree, cost=1 (Covering) -- "index scan"
⚈ SELECT * WHERE foo_id BETWEEN ... -- 2 BTrees, cost = (1 + N/100) + N -- possibly random lookups to reach for '*'
Notice how this can do a "range scan" on the INDEX BTree, but then has to random point queries into the PRIMARY KEY BTree.
⚈ SELECT * JOIN USING foo_id WHERE bar BETWEEN ... -- 3 BTrees, cost = (1 + N/100) + N + N
A GROUP BY may be able to do the "grouping" in one pass. Or it may need to gather the data into a temp table, sort that, then do the grouping.
These should be possible in a single pass:
⚈ INDEX (a) - WHERE 1 GROUP BY a
⚈ INDEX (a,b) - WHERE a=123 GROUP BY b
⚈ INDEX (a,b) - WHERE 1 GROUP BY a,b
(I say "should be possible", because the optimizer has a few glitches.)
An ORDER BY may be able to use an INDEX, thereby avoiding a separate 'filesort' pass.
These should be possible without a filesort:
⚈ INDEX (a) - WHERE 1 ORDER BY a
⚈ INDEX (a) - WHERE 1 ORDER BY a DESC
⚈ INDEX (a,b) - WHERE a=123 ORDER BY b
⚈ INDEX (a,b) - WHERE a=123 ORDER BY b DESC
⚈ INDEX (a,b) - WHERE 1 ORDER BY a,b
⚈ INDEX (a,b) - WHERE 1 ORDER BY a DESC, b DESC
These require a filesort
⚈ INDEX (a,b) - WHERE 1 ORDER BY a ASC, b DESC
⚈ INDEX (a,b) - WHERE 1 ORDER BY a,b
If you have both a WHERE and an ORDER BY, the optimizer has three choices:
⚈ If an INDEX perfectly matches both the WHERE and the ORDER BY (as above), use it.
⚈ Use an index for the WHERE, then do a filesort for the ORDER BY.
⚈ Fetch the rows based on the ORDER BY, and filter on the WHERE as the rows are fetched.
When given a choice between the second and third, the optimizer often does not have sufficient insight into the data to make a good decision.
"Using temporary, using filesort" can show up in the EXPLAIN. But it is a misnomer.
The "temporary" may be a RAM-only table, hence very efficient. Even if it spills to disk as a MyISAM table, it is not all that 'bad'.
A "filesort" in RAM is plenty fast (think C's qsort).
⚈ GROUP BY x ORDER BY x -- identical to GROUP BY x
⚈ GROUP BY x ORDER BY y -- requires 1 or 2 sorts.
If (and only if) an index handles _all_ of the WHERE, GROUP BY, and ORDER BY will the LIMIT be efficient. If "filesort" is mentioned in the EXPLAIN, then think of the query as being first performed without the LIMIT clause, then a LIMIT's worth of rows are delivered to the client.
⚈ INDEX(a) - WHERE 1 ORDER BY a LIMIT 5 -- reads only 5 rows
⚈ INDEX(a) - WHERE a>123 ORDER BY a LIMIT 5 -- reads only 5 rows
⚈ INDEX(a) - WHERE 1 ORDER BY bb LIMIT 5 -- reads and sorts entire table
OFFSET and LIMIT
LIMIT 500, 10 (or, equivalently LIMIT 10 OFFSET 500) must read 510 rows from the table (or the temp table after dealing with an ORDER BY).
Corollary: Pagination via OFFSET and LIMIT gets slower and slower. Instead, remember the id where you "left off" instead of using OFFSET.
Footnotes about MyISAM
⚈ MyISAM data (.MYD file) is not a BTree; it is accessed by an offset (think 'seek') that is either a byte offset or record number.
⚈ All indexes, including the PK, are in the .MYI file and are BTree-structured.
⚈ BTree blocks are 1KB.
⚈ The BTrees are cached in the key_buffer, configured of size key_buffer_size.
⚈ Data is cached by the OS, so leave some room for it.
⚈ The leaves of the indexes have a pointer into the .MYD file (not the PK).
⚈ Because of different leaf contents, don't expect PK fields to be automatically "covered" by a secondary key.
I coined "Count the disk hits" in 2009 or earlier. Original writing of this discussion -- June, 2013
Contact me by posting a question at MySQL Forums :: Performance
-- Rick James