MySQL: Building the best INDEX for a given SELECT
Brought to you by Rick James
You have a SELECT and you want to build the best INDEX for it.
This blog is a "cookbook" on how to do that task.
⚈ A short algorithm that works for many simpler SELECTs and helps in complex queries.
⚈ Examples of the algorithm, plus digressions into exceptions and variants.
⚈ Finally a long list of "other cases".
The hope is that a newbie can quickly get up to speed, and his/her INDEXes will no longer
smack of "newbie".
Many edge cases are explained, so even an expert may find something useful here.
Here's the way to approach creating an INDEX, given a SELECT.
Follow the steps below, gathering columns to put in the INDEX in order.
When the steps give out, you usually have the 'perfect' index.
1. Given a WHERE with a bunch of expressions connected by AND: Include the columns (if any), in any order, that are compared to a constant and not hidden in a function.
2. You get one more chance to add to the INDEX; do the first of these that applies:
⚈ 2a. One column used in a 'range' -- BETWEEN, >, LIKE w/o leading wildcard, etc.
⚈ 2b. All columns, in order, of the GROUP BY.
⚈ 2c. All columns, in order, of the ORDER BY.
This blog assumes you know the basic idea behind having an INDEX.
Here is a refresher on some of the key points.
Virtually all INDEXes in MySQL are structured as
BTrees allow very efficient for
⚈ Given a key, find the corresponding row(s);
⚈ "Range scans" -- That is, start at one value for the key and repeatedly find the "next" (or "previous") row.
A PRIMARY KEY is a UNIQUE KEY; a UNIQUE KEY is an INDEX. (KEY == INDEX.)
InnoDB clusters the PRIMARY KEY with the data.
Hence, given the value of the PK ("PRIMARY KEY"), after drilling down the BTree to find the index entry, you have
all the columns of the row when you get there.
A "secondary key" (any UNIQUE or INDEX other than the PK) in InnoDB first drills down the BTree
for the secondary index, where it finds a copy of the PK
Then it drills down the PK to find the row.
Every InnoDB table has a PRIMARY KEY. While there is a default if you do
not specify one, it is best to explicitly provide a PK.
For completeness: MyISAM works differently. All indexes (including the PK) are in separate BTrees.
The leaf node of such BTrees have a pointer (usually a byte offset) into the data file.
All discussion here assumes InnoDB tables, however most statements apply to other Engines.
A concise discussion of InnoDB's PK
First, some examples
Think of a list of names, sorted by last_name, then first_name.
You have undoubtedly seen such lists, and they often have other information such as address and phone number.
Suppose you wanted to look me up.
If you remember my full name ('James' and 'Rick'), it is easy to find my entry.
If you remembered only my last name ('James') and first initial ('R').
You would quickly zoom in on the Jameses and find the Rs in them. There, you might remember 'Rick' and ignore 'Ronald'.
But, suppose you remembered my first name ('Rick') and only my last initial ('J').
Now you are in trouble. You would be scanning all the Js -- Jones, Rick; Johnson, Rick; Jamison, Rick; etc, etc.
That's much less efficient.
Those equate to
INDEX(last_name, first_name) -- the order of the list.
WHERE last_name = 'James' AND first_name = 'Rick' -- best case
WHERE last_name = 'James' AND first_name LIKE 'R%' -- pretty good
WHERE last_name LIKE 'J%' AND first_name = 'Rick' -- pretty bad
Think about this example as I talk about "=" ('equality predicate') versus "range" in the Algorithm, below.
Algorithm, Step 1 (WHERE "column = const")
⚈ WHERE aaa = 123 AND ... : an INDEX starting with aaa is good.
⚈ WHERE aaa = 123 AND bbb = 456 AND ... : an INDEX starting with aaa and bbb is good. In this case, it does not matter whether aaa or bbb comes first in the INDEX.
⚈ xxx IS NULL : this acts like "= const" for this discussion.
⚈ WHERE t1.aa = 123 AND t2.bb = 456 -- You must only consider columns in the current table.
Note that the expression must be of the form of column_name = (constant).
The following do not work for step 1:
DATE(dt) = '...', LOWER(s) = '...', CAST(s ...) = '...', x='...' COLLATE... --
They "hide the column in a function call". The index cannot be used.
(If there are no "=" parts AND'd in the WHERE clause, move on to step 2 without any columns in your putative INDEX.)
Algorithm, Step 2
Find the first of 2a / 2b / 2c that applies; use it; then quit.
If none apply, then you are through gathering columns for the index.
In some cases it is optimal to do step 1 (all equals) plus step 2c (ORDER BY).
Algorithm, Step 2a (one range)
A "range" shows up as
⚈ aaa >= 123 -- any of <, <=, >=, >; but not <>, !=, IS NOT NULL
⚈ aaa BETWEEN 22 AND 44
⚈ sss LIKE 'blah%' -- but not sss LIKE '%blah'
⚈ xxx IS NOT NULL
Add the column in the range to your putative INDEX.
If there are more parts to the WHERE clause, you must stop now.
Complete examples (assume nothing else comes after the snippet)
⚈ WHERE aaa >= 123 AND bbb = 1 ⇒ INDEX(bbb, aaa) (WHERE order does not matter; INDEX order does)
⚈ WHERE aaa >= 123 ⇒ INDEX(aaa)
⚈ WHERE aaa >= 123 AND ccc > 'xyz' ⇒ INDEX(aaa) or INDEX(ccc) (only one range) (See ICP, below)
⚈ WHERE aaa >= 123 ORDER BY aaa ⇒ INDEX(aaa) -- Bonus: The ORDER BY will use the INDEX.
⚈ WHERE aaa >= 123 ORDER BY aaa ⇒ INDEX(aaa) DESC -- Same Bonus.
(There are some cases where multiple ranges on a single column will work.)
Algorithm, Step 2b (GROUP BY)
If there is a GROUP BY, all the columns of the GROUP BY should now be added, in the specified order, to the INDEX you are building.
(I do not know what happens if one of the columns is already in the INDEX.)
If you are GROUPing BY an expression (including function calls), you cannot use the GROUP BY; stop.
Complete examples (assume nothing else comes after the snippet)
⚈ WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
⚈ WHERE aaa >= 123 GROUP BY xxx ⇒ INDEX(aaa) (You should have stopped with Step 2a)
⚈ GROUP BY x,y ⇒ INDEX(x,y) (if there is no WHERE)
⚈ WHERE aaa = 123 GROUP BY xxx,(a+b) ⇒ INDEX(aaa) -- expression in GROUP BY, so no use including even xxx.
If the GROUP BY and ORDER BY have (or could have) the same columns in the same order; adjust them to 'match'. That way, both GROUP BY and ORDER BY can be done at the same time. This is likely to obviate an extra sort.
Algorithm, Step 2c (ORDER BY)
If there is a ORDER BY, all the columns of the ORDER BY should now be added, in the specified order, to the INDEX you are building.
If there are multiple columns in the ORDER BY, and there is a mixture of ASC and DESC, do not add the ORDER BY columns; they won't help; stop.
(There is an exception in version 8.0 -- If you declare the index with a mixture of ASC and DESC, the matching ORDER BY can use the index.)
If it does not matter to the results, change all to ASC or DESC.
If you are ORDERing BY an expression (including function calls), you cannot use the ORDER BY; stop.
Complete examples (assume nothing else comes after the snippet)
⚈ WHERE aaa = 123 GROUP BY ccc ORDER BY ddd ⇒ INDEX(aaa, ccc) -- should have stopped with Step 2b
⚈ WHERE aaa = 123 GROUP BY ccc ORDER BY ccc ⇒ INDEX(aaa, ccc) -- the ccc will be used for both GROUP BY and ORDER BY
⚈ WHERE aaa = 123 ORDER BY xxx ASC, yyy DESC ⇒ INDEX(aaa) -- mixture of ASC and DESC.
The following are especially good.
Normally a LIMIT cannot be applied until after lots of rows are gathered and then sorted according to the ORDER BY.
But, if the INDEX gets all they way through the ORDER BY, only (OFFSET + LIMIT) rows need to be gathered.
So, in these cases, you win the lottery with your new index:
⚈ WHERE aaa = 123 GROUP BY ccc ORDER BY ccc LIMIT 10 ⇒ INDEX(aaa, ccc)
⚈ WHERE aaa = 123 ORDER BY ccc LIMIT 10 ⇒ INDEX(aaa, ccc)
⚈ ORDER BY ccc LIMIT 10 ⇒ INDEX(ccc)
⚈ WHERE ccc > 432 ORDER BY ccc LIMIT 10 ⇒ INDEX(ccc) -- This "range" is compatible with ORDER BY
⚈ WHERE ccc < 432 ORDER BY ccc LIMIT 10 ⇒ INDEX(ccc) -- also works
(It does not make much sense to have a LIMIT without an ORDER BY, so I do not discuss that case.)
You have collected a few columns; put them in an INDEX and ADD that to the table.
That will often produce a "good" index for the SELECT you have.
Below are some other suggestions that may be relevant.
An example of the Algorithm being 'wrong':
SELECT ... FROM t WHERE flag = true;
This would (according to the Algorithm) call for INDEX(flag).
However, indexing a column that has two (or a small number of) different values is almost always useless. This is called 'low cardinality'.
The Optimizer would prefer to do a table scan than to bounce between the index BTree and the data.
On the other hand, the Algorithm is 'right' with
SELECT ... FROM t WHERE flag = true AND date >= '2015-01-01';
That would call for a composite (compound) index starting with a flag: INDEX(flag, date).
Such an index is likely to be very beneficial.
And it is likely to be more beneficial than INDEX(date).
Even more striking is with a LIMIT:
SELECT ... FROM t WHERE flag = true AND date >= '2015-01-01' LIMIT 10;
INDEX(flag, date) would be able to stop after 10 rows; INDEX(date) would not.
If your resulting INDEX includes column(s) that are likely to be UPDATEd, note that
the UPDATE will have extra work to remove a 'row' from one place in the INDEX's BTree
and insert a 'row' back into the BTree. For example:
UPDATE t SET x = ... WHERE ...;
There are too many variables to say whether it is better to keep the index or to toss it.
(There are exceptions to some of these.)
⚈ In 5.6 / 10.0, you may not include a column that equates to bigger than 767 bytes: VARCHAR(255) CHARACTER SET utf8 or VARCHAR(191) CHARACTER SET utf8mb4.
⚈ In 5.7 / 10.2, you may not include a column that equates to bigger than 3072 bytes.
⚈ You can deal with big fields using "prefix" indexing; but see below.
⚈ You should not have more than 5 columns in an index. (This is just a Rule of Thumb; up to 16 is allowed.)
⚈ You should not have redundant indexes. (See below.)
Workarounds for 767 problem
Stop after equality test
Technically, it is more complex than testing for '=': The relevant part from the documentation: "The optimizer attempts to use additional key parts to determine the interval as long as the comparison operator is =, <=>, or IS NULL. If the operator is >, <, >=, <=, !=, <>, BETWEEN, or LIKE, the optimizer uses it but considers no more key parts." See
Flags and Low Cardinality
INDEX(flag) is almost never useful if flag has very few values.
More specifically, when you say WHERE flag = 1 and "1" occurs more than
20% of the time, such an index will be shunned. The Optimizer would prefer
to scan the table instead of bouncing back and forth between the index and the data for more than 20% of the rows.
("20%" is really somewhere between 10% and 30%, depending on the phase of the moon.)
Example of 20% cutoff
When you have a composite index that could be in any order, the cardinality of the individual columns does not matter in picking the order. The cardinality of the entire index is what matters.
Discussions about Cardinality:
Cardinality & Range
Cardinality & composite
Equality before range, with 'proof'
A covering index is an index that contains all the columns in the SELECT.
It is special in that the SELECT can be completed by looking only at the INDEX BTree.
(Since InnoDB's PRIMARY KEY is clustered with the data, "covering" is of no benefit when considering at the PRIMARY KEY.)
1. Gather the list of column(s) according to the "Algorithm", above.
2. Add to the end of the list the rest of the columns seen in the SELECT, in any order.
⚈ SELECT x FROM tbl WHERE y = 5; ⇒ INDEX(y,x) -- The algorithm said just INDEX(y)
⚈ SELECT x,z FROM tbl WHERE y = 5 AND q = 7; ⇒ INDEX(y,q,x,z) -- y and q in either order (Algorithm), then x and z in either order (covering).
⚈ SELECT x FROM tbl WHERE y > 5 AND q > 7; ⇒ INDEX(y,q,x) -- y or q first (that's as far as the Algorithm goes), then the other two fields afterwards.
The speedup you get might be minor, or it might be spectacular; it is hard to predict.
⚈ It is not wise to build an index with lots of columns. Let's cut it off at 5 (Rule of Thumb).
⚈ Prefix indexes cannot 'cover', so don't use them anywhere in a 'covering' index.
⚈ There are limits (3KB?) on how 'wide' an index can be, so "covering" may not be possible.
InnoDB includes the PK columns' values in each secondary key's BTree so that it can locate the entire row in the data's BTree. The PK columns are tacked on the end of the secondary key, but (I am pretty sure) they are not unnecessarily duplicated. (MyISAM does not do this; instead, it uses a pointer into the .MYD file.)
Ref manual definition
More on Covering
INDEX(a,b) can find anything that INDEX(a) could find. So you don't need both. Get rid of the shorter one. There are even cases where the Optimizer will pick the 'shorter' index, thereby leading to poorer performance.
If you have lots of SELECTs and they generate lots of INDEXes, this may cause a different problem.
Each index must be updated (sooner or later) for each INSERT. More indexes ⇒ slower INSERTs.
Limit the number of indexes on a table to about 6 (Rule of Thumb).
Notice in this cookbook how it says "in any order" in a few places.
If, for example, you have both of these (in different SELECTs):
⚈ WHERE a=1 AND b=2 begs for either INDEX(a,b) or INDEX(b,a)
⚈ WHERE a>1 AND b=2 begs only for INDEX(b,a)
Include only INDEX(b,a) since it handles both cases with only one INDEX.
Suppose you have a lot of indexes, including (a,b,c,dd) and (a,b,c,ee).
Those are getting rather long. Consider either picking one of them, or having simply (a,b,c).
Sometimes the selectivity of (a,b,c) is so good that tacking on 'dd' or 'ee' does make enough difference to matter.
Optimizer picks ORDER BY
The main cookbook skips over an important optimization that is sometimes used.
The optimizer will sometimes ignore the WHERE and, instead, use an INDEX that matches the ORDER BY.
This, of course, needs to be a perfect match -- all columns, in the same order. And ASC/DESC caveats handled.
This becomes especially beneficial if there is a LIMIT.
But there is a problem. There could be two situations, and the Optimizer is sometimes not smart enough to see which INDEX would be better:
⚈ If the WHERE does very little filtering, fetching the rows in ORDER BY order avoids a sort and has little wasted effort (because of 'little filtering'). Using the INDEX matching the ORDER BY is better in this case.
⚈ If the WHERE does a lot of filtering, the ORDER BY is wasting a lot of time fetching rows only to filter them out. Using an INDEX matching the WHERE clause is better.
What should you do? If you think the "little filtering" is likely, then create an index with the ORDER BY columns in order
and hope that the Optimizer uses it when it should.
⚈ WHERE a=1 OR a=2 -- This is turned into WHERE a IN (1,2) and optimized that way.
⚈ WHERE a=1 OR b=2 usually cannot be optimized.
⚈ WHERE x.a=1 OR y.b=2 This is even worse because of using two different tables.
A workaround is to use UNION. Each part of the UNION is optimized separately. For the second case:
( SELECT ... WHERE a=1 ) -- and have INDEX(a)
UNION DISTINCT -- "DISTINCT" is assuming you need to get rid of dups
( SELECT ... WHERE b=2 ) -- and have INDEX(b)
GROUP BY ... ORDER BY ... -- whatever you had at the end of the original query
Now the query can take good advantage of two different indexes.
Note: "Index merge" might kick in on the original query,
but it is not necessarily any faster than the UNION.
Sister blog on compound indexes, including 'Index Merge'
The third case (OR across 2 tables) is similar to the second.
If you originally had a LIMIT, UNION gets complicated.
If you started with ORDER BY z LIMIT 190, 10, then the UNION needs to be
( SELECT ... ORDER BY ... LIMIT 200 ) -- effectively OFFSET 0, LIMIT 190+10
UNION DISTINCT -- (or ALL)
( SELECT ... ORDER BY ... LIMIT 200 )
ORDER BY ...
LIMIT 190, 10 -- Same as originally
In old versions of MySQL, UNION always created a temp table.
Newer versions (MySQL 5.7.3 / MariaDB 10.1) can avoid the temp table in selected situations of UNION ALL.
stackoverflow Example of OR to UNION
Another discussion of UNION+LIMIT
TEXT / BLOB
You cannot directly index a TEXT or BLOB or large VARCHAR or large BINARY column.
However, you can use a "prefix" index: INDEX(foo(20)).
This says to index the first 20 characters of foo.
But... It is rarely useful.
Examples of a prefix index:
The index for me would contain 'Ja', 'Rick'. That's not useful for distinguishing between
'Jamison', 'Jackson', 'James', etc.
You should probably never do UNIQUE(foo(20)) because this applies a uniqueness constraint on the first 20 characters of the column,
not the whole column!
More on prefix indexing
DATE, DATETIME, etc. are tricky to compare against.
Some tempting, but inefficient, techniques:
date_col LIKE '2016-01%' -- must convert date_col to a string, so acts like a function
LEFT(date_col, 7) = '2016-01' -- hiding the column in function
DATE(date_col) = 2016 -- hiding the column in function
All must do a full scan.
(On the other hand, it can handy to use GROUP BY LEFT(date_col, 7) for monthly grouping, but that is not an INDEX issue.)
This is efficient, and can use an index:
date_col >= '2016-01-01'
AND date_col < '2016-01-01' + INTERVAL 3 MONTH
This case works because both right-hand values are converted to constants, then it is a "range".
I like the design pattern with INTERVAL because it avoids computing the last day of the month.
And it avoids tacking on '23:59:59', which is wrong if you have microsecond times.
(And other cases.)
Hiding in a function call is called not "sargable":
Wikipedia on Sargable
Perform EXPLAIN SELECT... (and EXPLAIN FORMAT=JSON SELECT... if you have 5.6.5).
Look at the Key that it chose, and the Key_len. From those you can deduce how many columns of the index
are being used for filtering. (JSON makes it easier to get the answer.)
From that you can decide whether it is using as much of the INDEX as you thought.
Caveat: Key_len only covers the WHERE part of the action; the non-JSON output won't easily say whether GROUP BY or ORDER BY was handled by the index.
Tip: If the index has just an INT and Key_len says 5, that means the extra byte implies that it was NULL. Perhaps it should have been NOT NULL?
IN (1,99,3) is sometimes optimized as efficiently as "=", but not always.
Older versions of MySQL did not optimize it as well as newer versions. (5.6 is possibly the main turning point.)
IN ( SELECT ... )
From version 4.1 through 5.5, IN ( SELECT ... ) was very poorly optimized.
The SELECT was effectively re-evaluated every time.
Often it can be transformed into a JOIN, which works much faster.
Heres is a pattern to follow:
AND x IN (
JOIN b USING(x)
The SELECT expressions will need "a." prefixing the column names.
Alas, there are cases where the pattern is hard to follow.
5.6 does some optimizing, but probably not as good as the JOIN.
If there is a JOIN or GROUP BY or ORDER BY LIMIT in the subquery, that complicates the JOIN in new format.
So, it might be better to use this pattern:
AND x IN ( SELECT x FROM ... );
JOIN ( SELECT x FROM ... ) b
Caveat: If you end up with two subqueries JOINed together, note that neither has any indexes, hence performance
can be very bad. (5.6 improves on it by dynamically creating indexes for subqueries.)
MariaDB and Oracle 5.7 have improved in relation to "NOT IN", "NOT EXISTS", and "LEFT JOIN..IS NULL";
here is an
old discussion on the topic
So, what I say there may not be the final word.
Further optimizations involve turning IN into EXISTS or vice versa. So, IN ( SELECT ... ) used to be terrible; now it is "maybe".
Index Condition Pushdown (ICP)
In EXPLAINs, it will be indicated by "Using index condition". Note: This is different than just "Using index".
MySQL has two layers of code -- the "Handler" that handles engine-independent operations, and the Engine layer. Originally, most optimization code was in the Handler layer. That included WHERE tests that were not optimized out by an INDEX. This implied that the Engine layer had to return rows to the Handler layer, only to have it someties toss them. Beginning with MySQL 5.6, InnoDB took on that task. This made some queries a little faster. (I doubt if a query is made more than twice as fast via ICP.)
An example is a WHERE with 2 ranges.
WHERE latitude BETWEEN ... AND ...
AND longitude BETWEEN ... AND ...
Each index can filter only on latitude. However the test for longitude is handled differently depending on ICP. Without ICP (eg, the first index), the rows would be bounced up to the Handler for secondary filtering on longitude. With ICP (the composite index), InnoDB can do that extra test enough faster to make it worth doing.
If you declare both indexes, the Optimizer seems to pick the 1-column index, thereby failing to get the ICP advantage, so it runs as slow as if ICP were turned off. Bottom line: For this example, have only the composite index.
Actually you should also have these for the particular application, since the Optimizer will use statistics to pick between them, to good effect.)
More discussion about Bounding boxes and "find nearest"
When you have a JOIN and a GROUP BY, you may have the situation where
the JOIN exploded more rows than the original query (due to 1:many or many:many), but you wanted
only one row from the original table, so you added the GROUP BY to implode back to
the desired set of rows.
This explode + implode, especially because of the large temp table, is costly.
It would be better to avoid them if possible.
Sometimes the following will work.
Using GROUP_CONCAT to isolate the explosion:
JOIN b USING(x)
GROUP BY a.id
( SELECT GROUP_CONCAT(b.y) FROM b WHERE b.x = a.x ) AS ys
First find the ids, then do the rest. This avoids the GROUP BY.
Many-to-Many Mapping table
Do it this way.
CREATE TABLE XtoY (
# No surrogate id for this table
x_id MEDIUMINT UNSIGNED NOT NULL, -- For JOINing to one table
y_id MEDIUMINT UNSIGNED NOT NULL, -- For JOINing to the other table
# Include other fields specific to the 'relation'
PRIMARY KEY(x_id, y_id), -- When starting with X
INDEX (y_id, x_id) -- When starting with Y
⚈ Do not have an AUTO_INCREMENT id for this table -- The PK given is the 'natural' PK; there is no good reason for a surrogate.
⚈ MEDIUMINT -- This is a reminder that all INTs should be made as small as is safe (smaller ⇒ faster). Of course the declaration here must match the definition in the table being linked to.
⚈ UNSIGNED -- Nearly all INTs may as well be declared non-negative
⚈ NOT NULL -- Well, that's true, isn't it?
⚈ InnoDB -- More effecient than MyISAM because of the way the PRIMARY KEY is clustered with the data in InnoDB.
⚈ INDEX(y_id, x_id) -- The PRIMARY KEY makes it efficient to go one direction; this index makes the other direction efficient. No need to say UNIQUE; that would be extra effort on INSERTs.
⚈ In the secondary index, saying just INDEX(y_id) would work because it would implicit include x_id. But I would rather make it more obvious that I am hoping for a 'covering' index. (Prior to 5.6.9, you need to include both ids to get the performance. cf use_index_extensions.)
To conditionally INSERT new links, use
Note that if you had an AUTO_INCREMENT in this table, IODKU would "burn" ids quite rapidly.
Another discussion of many:many
About 20 names for this type of table
Associative Entity, Association table, Bridge table, Cross-reference table, Crosswalk, Intermediary table, Intersection table, Join table, Junction table, Link table, Linking table, Many-to-many resolver, Map table, Mapping table, Pairing table, Relationship table, Transition table
If one of the two tables is just an id and string, consider getting rid of that table. In which case, the many:many table can be simplified; see
Philosophy of many:many
"This solution improved my postmeta query time from 32 seconds to 4.9 seconds. – osemec" -- see
Why are references to wp_postmeta so slow?
A simple example of the mapping table
Posts and categories (example)
WordPress Plugin to improve wp_postmeta's indexes
Example of converting to new schema
Other Primary Key patterns
Here are some common patterns. (assuming id is AUTO_INCREMENT`)
A uuid is very random, hence this may be optimal:
PRIMARY KEY (id),
Sometimes the artificial id is beneficial, but this can let you use the PK for "locality of reference":
PRIMARY KEY(user_id, id), -- cluster together all the users stuff
INDEX(id) -- sufficient to heep AUTO_INCREMENT happy
UNIQUE vs PRIMARY
Subqueries and UNIONs
Each subquery SELECT and each SELECT in a UNION can be considered separately for finding the optimal INDEX.
Exception: In a "correlated" ("dependent") subquery, the part of the WHERE that depends on the outside table
is not easily factored into the INDEX generation. (Cop out!)
Everything in this section should be qualified with "usually". There are many exceptions; I will cover the common things.
WHERE versus ON
Use ON for saying how the tables are "related". Use WHERE for "filtering", by specifying which rows to keep.
For human consumption, please put the conditions in WHERE or ON according to that rule.
The Optimizer sees clauses in WHERE and ON as equivalent except in the case of LEFT JOIN.
Nested Loop Join
The Optimizer almost always looks through one table first. Then, for each row that might be used in the 'first' table, it looks up the matching row in the 'next' table. And so on for each other table. This is called "NLJ" or "Nested Loop Join".
In rare cases it will do an "Index merge". This is where it scans two tables, fetching the Primary keys, then ANDs or ORs them to decide which rows to actually fetch.
Another rare case involves fetching an entire table into an in-RAM has. In some cases, this runs faster; in some case it runs slower; it is hard to predict. It is mostly controlled by the setting join_buffer_size, which should not be set bigger than 1% of RAM.
The 'first' table
The Optimizer is free to rearrange the order of tables as long as it will get the same resultset. You can force an order via STRAIGHT_JOIN, but this should be left as a last resort.
Novices (and some 3rd party packages) blindly use LEFT JOIN even when JOIN (aka INNER JOIN) is intended. The Optimizer usually recognizes when LEFT is improper and turns it into a plain JOIN. It is better for anyone reading the query to be told the intent.
Some Rules of Thumb on how the tables will be ordered:
⚈ STRAIGHT_JOIN forces the order. (Try not to use this -- for the same reasons that I argue against FORCE INDEX.)
⚈ A table or derived table that obviously delivers one row will be first.
⚈ Table(s) being filtered by a WHERE will come before this not being filtered.
⚈ A "derived" table may be performed early.
⚈ The 'righthand' table of a true LEFT JOIN must come later then the lefthand table.
⚈ When two tables are joined, but neither is being filtered (WHERE), the smaller one will be used first.
But this blog is about Indexing! Well, the optimal index for the 'first' table needs to start with the column(s) in the WHERE clause for that table. The optimal index for the non-first tables should start a combination of the column(s) mentioned in both the ON and the WHERE.
There are cases where you cannot predict which table with be 'first', so it is best to provide two optimal indexes -- for being 'first' or not 'first'.
If the query has ORDER BY, then those columns may be used by the index for the the 'first' table. For GROUP BY, see "Explode/Implode".
A "derived table" is a subquery in FROM or JOIN.
FROM a, b WHERE a.x = b.x is the old syntax; it works, but please do not use it.
FROM a JOIN b ON a.x = b.x is the new syntax; please use it.
The first step is to decide what order the optimizer will go through the tables.
If you cannot figure it out, then you may need to be pessimistic and create
two indexes for each table -- one assuming the first table will be used first,
one assiming that it will come later in the table order.
The optimizer usually starts with one table and extracts the data needed from it.
As it finds a useful (that is, matches the WHERE clause, if any) row, it reaches into the 'next' table.
This is called NLJ ("Nested Loop Join").
The process of filtering and reaching to the next table continues through the rest of the tables.
The optimizer usually picks the "first" table based on these hints:
⚈ STRAIGHT_JOIN forces the the table order.
⚈ The WHERE clause limits which rows needed (whether indexed or not). That is, if you are filtering on one table, but not others, it will be used first.
⚈ The table to the "left" in a LEFT JOIN usually comes before the "right" table. (By looking at the table definitions, the optimizer may decide that "LEFT" is irrelevant.)
⚈ The current INDEXes will encourage an order.
Running EXPLAIN tells you the table order that the Optimizer is very likely to use today.
After adding a new INDEX, the optimizer may pick a different table order.
You should anticipate the order changing, guess at what order makes the most sense, and build the INDEXes accordingly.
Then rerun EXPLAIN to see if the Optimizer's brain was on the same wavelength you were on.
You should build the INDEX for the "first" table based on any parts of the WHERE, GROUP BY, and ORDER BY
clauses that are relevant to it. If a GROUP/ORDER BY mentions a different table, you should ignore that clause.
The second (and subsequent) table will be reached into based on the ON clause.
(Instead of using commajoin, please write JOINs with the JOIN keyword and ON clause!)
In addition, there could be parts of the WHERE clause that are relevant.
GROUP/ORDER BY are not to be considered in writing the optimal INDEX for subsequent tables.
A JOIN example
Several rules for creating INDEXes for JOINs
FROM ( SELECT ... FROM a ... ) AS x
JOIN b ON ...
The subquery on a is called a "derived table". It will be computed first before considering the JOINed tables. So, Provide the optimial index for a, then assume it is the "first" table in the JOIN.
FROM ( SELECT ... FROM a ... ) AS x
JOIN ( SELECT ... FROM b ... ) AS y ON ...
If one of those derived tables delivers a single row (and the Optimizer can figure that out) and it will be computed first. But if both derived tables return multiple rows, there is a problem. There are no indexes on them. In older MySQL versions this led to repeated full table scans. In newer versions, the Optimizer goes to the extra effort (and time) to figure out what the optimal index is for the temp table where it stored the results of the subquery, and makes an index on the fly. In EXPLAIN you can see this via auto-key.
If a derived table has a GROUP BY or LIMIT, then it may be worth having.
If a derived table does not have a GROUP BY or LIMIT, it is usually worth trying to turn it into a JOIN at the outer level.
PARTITIONing is rarely a substitute for a good INDEX.
PARTITION BY RANGE is a technique that is sometimes useful when indexing fails to be good enough.
In a two-dimensional situation such as nearness in a geographical sense, one dimension can partially be handled by
partition pruning; then the other dimension can be handled by a regular index (preferrably the PRIMARY KEY).
This goes into more detail:
Find nearest 10 pizza parlors
The optimal INDEXes for a partitioned table are different. I elaborate here:
FULLTEXT is now implemented in InnoDB as well as MyISAM.
It provides a way to search for "words" in TEXT columns.
This is much faster (when it is applicable) than col LIKE '%word%'.
WHERE x = 1
AND MATCH (...) AGAINST (...)
always(?) uses the FULLTEXT index first. That is, the whole Algorithm is invalidated
when one of the ANDs is MATCH..AGAINST.
AGAINST('+mandatory +words' IN BOOLEAN MODE) -- for requiring word(s).
AGAINST('+model T +Ford' IN BOOLEAN MODE) -- Don't put a + on a "word" that will be ignored due to shortness.
WHERE MATCH (car) AGAINST ('+model T +Ford' IN BOOLEAN MODE)
AND car LIKE '%model T Ford%'
to avoid "model A Ford", but still get good performance. The FT search will come first and find both cars; then the LIKE need check only those few rows found.
A suggestion on FT's own id
Comparison of Fulltext with alternatives
Search multiple columns for a substring
A common anti-pattern is to allow the user to specify a "string" and then search for it in all the likely columns:
WHERE company LIKE '%string%'
OR product LIKE '%string%'
OR id LIKE '%string%' ...
This requires reading all the rows and testing all the specified columns for "string". Very slow. I'll discuss some alternatives:
Plan A: Use FULLTEXT.
FULLTEXT(company, product, id, ...)
MATCH(company, product, id, ...) AGAINST("+string" IN BOOLEAN MODE)
provides a much faster way to do the search. But it comes with some caveats:
⚈ FULLTEXT as some limitations (length, frequency, etc) on what "words" are allowed.
⚈ The + in front of each word says it is mandatory, but don't put it in front of a word that is too short.
⚈ If there are other tests in the WHERE, they will be handled after isolating the rows that match. (This is usually 'good'.)
Plan B: FULLTEXT on a combined column.
Here, I suggest an extra column that contains all the words from all the columns the user might need to be searching. Then have a single FULLTEXT index on that column.
⚈ Similar performance, etc, to Plan A.
⚈ Gives you a chance to "cleanse" the word list before blindly throwing it into the index.
Plan C: Plan B, but in a separate table.
⚈ The column (plus the PRIMARY KEY of the main table) might be the only columns in this table.
⚈ Not much advantage, except for constructing this "after the fact", when it could be quite costly to add a column and an FT index to your huge table.
Plan D: Hybrid between one of those plans, but extra tests.
WHERE MATCH(...) AGAINST(...)
AND something else to further whittle down the number of rows.
⚈ That AND might include a REGEXP test that could not effectively be handled by the word-oriented FT index.
Plan E: What about searching for something that is not a "word"?
⚈ Searching for punctuation may be possible with suitable tuning of innodb_ft_* variables.
⚈ Punctuation could be avoided in Plan B by "cleansing" the data you put into the extra column. For example: punctuation could be removed and the "word" closed up.
⚈ Your app should look at the user's query, then decide whether to (a) build and execute a FULLTEXT query, or (b) build and execute an old, slow, string of LIKEs, or (c) disallow the search.
Signs of a Newbie
⚈ No "compound" (aka "composite") indexes
⚈ No PRIMARY KEY
⚈ Redundant indexes (especially blatant is PRIMARY KEY(id), KEY(id))
⚈ Most or all columns individually indexes ("But I indexed everything")
⚈ "Commajoin" -- That is FROM a, b WHERE a.x=b.x instead of FROM a JOIN b ON a.x=b.x
Speeding up wp_postmeta
The schema changes mentioned here are automated in
WordPress Plugin to improve wp_postmeta's indexes
The published table (see Wikipedia) is
CREATE TABLE wp_postmeta (
meta_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
post_id bigint(20) unsigned NOT NULL DEFAULT '0',
meta_key varchar(255) DEFAULT NULL,
PRIMARY KEY (meta_id),
KEY post_id (post_id),
KEY meta_key (meta_key)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Wikipedia - Wordpress schema
⚈ The AUTO_INCREMENT provides no benefit; in fact it slows down most queries (because of having to look in secondary index to find auto_inc id, then looking in data for actual id you need)
⚈ The AUTO_INCREMENT is extra clutter - both on disk and in cache.
⚈ Much better is PRIMARY KEY(post_id, meta_key) -- clustered, handles both parts that are usually in the WHERE.
⚈ BIGINT is overkill, but that can't be fixed without changing other tables.
⚈ VARCHAR(255) can be a problem in 5.6 with utf8mb4; see workarounds below.
⚈ When would meta_key or meta_value ever be NULL?
CREATE TABLE wp_postmeta (
post_id BIGINT UNSIGNED NOT NULL,
meta_key VARCHAR(255) NOT NULL,
meta_value LONGTEXT NOT NULL,
PRIMARY KEY(post_id, meta_key),
The typical usage:
JOIN wp_postmeta AS m ON p.id = m.post_id
WHERE m.meta_key = '...'
⚈ The composite PRIMARY KEY goes straight to the desired row, no digression through secondary index, nor search through multiple rows.
⚈ INDEX(meta_key) may or may not be useful, depending on what other queries you have.
⚈ InnoDB is required for the 'clustering'.
⚈ Going forward, use utf8mb4, not utf8. However, you should be consistent across all WP tables and in your connection parameters.
The error "max key length is 767", which can happen in MySQL 5.5 and 5.6 when
trying to use CHARACTER SET utf8mb4.
Do one of the following (each has a drawback) to avoid the error:
⚈ Upgrade to 5.7.7 for 3072 byte limit
⚈ Change 255 to 191 on the VARCHAR -- you lose any values longer than 191 characters (but you may not have any);
⚈ ALTER .. CONVERT TO utf8 -- you lose Emoji and some of Chinese;
⚈ Use a "prefix" index -- you lose some of the performance benefits;
⚈ Reconfigure (if staying with 5.6.3 - 5.7.6) -- 4 things to change: Barracuda + innodb_file_per_table + innodb_large_prefix + dynamic or compressed.
Note: In later versions, these are deprecated (5.7 and 10.2?) because they are no longer necessary: innodb_file_format, innodb_file_format_check, innodb_file_format_max and innodb_large_prefix. If you get an error to that effect, simply don't bother setting them. They are actually removed in 8.0 / 10.4(?)
If you need the ability to have multiple meta keys with the same key name for one post, then use this solution. It is nearly as good as the above suggestion.
CREATE TABLE wp_postmeta (
meta_id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
post_id BIGINT UNSIGNED NOT NULL,
meta_key VARCHAR(255) NOT NULL,
meta_value LONGTEXT NOT NULL,
PRIMARY KEY(post_id, meta_key, meta_id), -- to allow dup meta_key for a post
INDEX(meta_id), -- to keep AUTO_INCREMENT happy
WordPress forum - on slow postmeta
The ALTER to achieve the desired changes:
To see whether you currently have any dup keys for any post:
SELECT post_id, meta_key, GROUP_CONCAT(meta_id)
GROUP BY post_id, meta_key
HAVING COUNT(*) > 1
This removes meta_id, which is probably useless and definitely a performance burden:
ALTER TABLE wp_postmeta
DROP PRIMARY KEY,
DROP COLUMN meta_id,
DROP INDEX post_id,
ADD PRIMARY KEY(post_id, meta_key); -- does not allow dup meta_key for a post
This is a compromise -- it keeps meta_id, but makes most of the other performance improvements:
ALTER TABLE wp_postmeta
DROP PRIMARY KEY,
DROP INDEX post_id,
ADD PRIMARY KEY(post_id, meta_key, meta_id), -- to allow dup meta_key for a post
ADD INDEX(meta_id); -- to keep AUTO_INCREMENT happy
Consider changing wp_commentmeta, wp_sitemeta, wp_termmeta, and wp_usermeta in similar ways.
Another WP tip: If you often have self-joins to test multiple meta attributes, decrease the value of optimizer_search_depth to, say, 4.
An example, with ALTER
The EXPLAIN command helps give you some insight into what is going on, but fails to direct you towrard betters indexing or query formulation. Anyway, here are the options:
1. EXPLAIN is quite limited. It does not handle LIMIT very well, nor does it necessarily say which step uses the filesort, or even if there are multiple filesorts. In general, the "Rows" column is useful, but in certain situations, it is useless. Simple rule: A big number is probably a bad sign.
2. EXPLAIN EXTENDED + SHOW WARNINGS; provides the rewritten version of the query. This does not do a lot. It does give clues of the distinction between ON and WHERE in JOINs. (In newer versions EXTENDED is not needed; you still get the WARNING.)
3. EXPLAIN FORMAT=JSON provide more details into the cost-based analysis and spells out various steps, including filesorts.
4. "Optimizer trace" goes a bit further. (It is rather tedious to read.) See next section.
This is available beginning with 5.6.3.
SELECT ... -- Your query
SELECT * FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE;
This last thing will produce a "table" with 4 columns. Here is some PHP code to handle that if you get an array of arrays:
foreach ($opt_traces as $opt_trace)
list($query, $trace, $mbbmms, $priv) = $opt_trace;
echo "Optimizer Trace:<pre>$trace</pre>\n"; // This will be JSON.
echo "$mbbmms bytes missing beyond max_mem_size<br>\n";
echo "Insufficient privileges for optimizer trace<br>\n";
For even a moderately complex SELECT, you may exceed the current setting of optimizer_trace_max_mem_size; Set it higher in the initial sets if needed.
The "Rows" of EXPLAIN gives an estimate of the number of rows touched. Sometimes this is a terrible estimate. There is a way to get the exact number of rows: SHOW SESSION STATUS LIKE 'Handler%';. In some situations, this is an excellent way to get a feel for a specific query, or to compare two similar queries/indexes/etc.
FLUSH STATUS; -- zero out most of SESSION STATUS (non existent on really old versions of MySQL)
SELECT ...; -- run your query
SHOW SESSION STATUS LIKE 'Handler%';
Sum up the Handler_read_% values to get what is perhaps an exact count of how many rows (index and/or table) are touched in peforming the SELECT. If there are non-zero values for Handler_write_%, then temp table(s) were created for this query.
(Note: Using the equivalent SELECT ... FROM information_schema.session_status is ineffective since it includes some of this select itself.)
Example of Handler Counts
MySQL 8.0.18 introduced EXPLAIN ANALYZE and EXPLAIN ANALYZE FORMAT=JSON as a way to get more info on a SELECT.
Here's a possible implementation of a "sequence".
CREATE TABLE seq (
id INT UNSIGNED AUTO_INCREMENT NOT NULL,
code CHAR(1) NOT NULL,
) ENGINE=InnoDB COMMENT 'An implementation of a SEQUENCE';
INSERT INTO seq (code) VALUE ('a'); -- initialize only
INSERT INTO seq (code) VALUE ('a')
ON DUPLICATE KEY UPDATE id = LAST_INSERT_ID(id+1);
SELECT LAST_INSERT_ID(); -- to retrieve the new number:
⚈ This allows multiple sequences, based on code.
⚈ The nums will be monotonically increasing (never dup, never decrease).
⚈ It will provide consecutive numbers (no gaps).
⚈ It will not back-fill.
⚈ It will survive crashes; deletes from your table; etc.
⚈ If you get a new number inside a transaction, replication could generate the ids not in chronological order.
⚈ If you get a new number outside a transaction (autocommit), you could 'burn' the number (create a gap) if certain failures occur.
In MariaDB 10.0+, see
SELECT * FROM seq_1_to_5;
MariaDB's Sequence engine for generating numbers
SELECT '2019-01-01' + INTERVAL seq-1 DAY FROM seq_1_to_31;
| '2019-01-01' + INTERVAL seq-1 DAY |
| 2019-01-01 |
| 2019-01-02 |
| 2019-01-03 |
| 2019-01-04 |
| 2019-01-05 |
| 2019-01-06 |
⚈ "Index merge intersection" is perhaps always not as good as a composite index.
⚈ "Index merge union" is almost never used. You might be able to turn OR into UNION effectively.
⚈ Something often forgotten about (but does show up as an unexplained large "Rows"): COLLATIONs must match. This may show up when JOINing on a string column but failing to use the same CHARACTER SET and/or COLLATION for the columns in both tables.
⚈ A trick to make use of the PK clustering: PRIMARY KEY(foo, id), INDEX(id). This comes in handy when many queries say WHERE foo = ... but need id for uniqueness. A table too big for the buffer_pool benefits especially.
⚈ FORCE INDEX is handy for experimenting, but almost always a bad idea for production. "It may help today, but the data will change tomorrow and make things worse."
⚈ When are a pair of UNIQUE/non-UNIQUE indexes redundant? See
this stackoverflow answer
⚈ There are few cases where a table should have multiple UNIQUE (or PRIMARY) keys. A common, simple, one that is valid is a normalization table that maps between an auto_increment id and a name.
⚈ WHERE + ORDER BY + LIMIT example
⚈ Iterating through a composite index (Continuing where you left off)
⚈ WHERE varchar_col = 1234 (string compared to literal number) -- This is slow because each string is converted to a number to test. Either quote the number or change the column type.
Ref manual on type conversions
Initial posting: March, 2015; Refreshed: Feb, 2016; Add DATE: June, 2016;
WP example: May and Sep, 2017;
Explain, Opt trace, and Miscellany: Jan, 2019
ICP & derived tables & refresh: May, 2019
Sequence: Aug, 2019
Except as noted, the tips in this document apply to MySQL, MariaDB, and Percona.
Some of the optimizations depend on the way InnoDB 'clusters' the PRIMARY KEY with the data.
Since this is not the case for some other Engines and Vendors, some of the tips may not apply elsewhere.
Understanding composite index
Percona 2015 Tutorial Slides
Sheeri Cabral's Index talk
Sheeri's EXPLAIN talk
Bill Karwin on designing indexes
Some info in the manual:
ORDER BY Optimization
A short, but complicated,
Manual page on range accesses in composite indexes
Some discussion of JOIN
Detecting whether an index is needed
This blog is the consolidation of a Percona tutorial I gave in 2013, plus many years of experience in
fixing thousands of slow queries on hundreds of systems.
I appologize that this does not tell you how create INDEXes for all SELECTs. Some are just too complex.
Indexing 101: Optimizing MySQL queries on a single table
(Stephane Combaudon - Percona)
A complex query, well explained.
What does Using filesort mean in MySQL?
JOIN - with step-by-step explanation
Do not start index with a 'range'
Natural PK vs AUTO_INCREMENT debate
3-hour slides from Jynus
Which to pick - WHERE or ORDER BY
A moderately involved example
- Also see the blogs referred to.
with good pictures on how a BTree works, the order of columns in an index, etc.
Discuss step 2b
Temporary tables used in indexing
Why we can't have more than one inequality condition in MySQL indexing?
Examples of when OR, LIKE, etc is / isn't optimizable:
Sargability of various tests
Why does MySQL not always use index for select query?
Some duplicate index examples
Some more advanced topics; includes diagrams of how indexes are laid out:
Which to use? a range or an ORDER BY
Order of joins; indexing
Covering versus Leftmost
A specific 3-column case with LIMIT
[[https://stackoverflow.com/questions/24315151/does-order-of-fields-of-multi-column-index-in-mysql-matter][Order of fields]] (last_name, first_name discussion)
Indexing 101 and 102
Ref man on cluster vs secondary
Comparing Full table/index scans
Example of combining composite indexes
Discussion of the 20% RoT
Nice, basic, discussion of Indexing
Matching the ORDER BY to the WHERE lets the LIMIT be optimized:
Example of JOINs and discussion of Explode-Implode
Nice discussion of composite + covering example
Discussion of some of the subtle points above
Auto_inc vs natural PK
Examples of rows/block and indexes
A deteailed discussion of one case
[[https://www.percona.com/blog/mysql-8-0-functional-indexes/][Functional indexes (8.0.13) and Generated columns (5.7)]
Book DB - indexing, normalizing
Joins, Left Join, derived tables, advice
Examples of Leftmost (ordering columns in an INDEX)
-- Rick James
MySQL Documents by Rick James
HowTo Techniques for Optimizing Tough Tasks:
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Data Warehouse Overview
High speed ingestion
Schema and code design for large Sensor database
Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative
5 methods for 'Find Nearest'
Lat/Lng search to Find the nearest 10 pizza parlors
Lat/Long representation choices
Z-Order 'find nearest'
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Efficient List of Latest 10 news articles
Build and execute a Pivot SELECT (showing rows as columns)
(Groupwise Max): Efficiently find largest row(s) for each group
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
other language tips
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:
Did my articles help you out? Like what you see? Consider donating: