PARTITION Maintenance in MySQL

Table of Contents

First, my Opinions on PARTITIONing
Use Cases for PARTITIONing
AUTO_INCREMENT in PARTITION
The Maintenance Situation
The Problem
High Level View of the Code
Why?
When to do the ALTERs?
Variants
Detailed code
Postlog
Brought to you by Rick James


First, my Opinions on PARTITIONing


Taken from
Rick's RoTs - Rules of Thumb for MySQL
    ⚈  #1: Don't use PARTITIONing until you know how and why it will help.
    ⚈  Don't use PARTITION unless you will have >1M rows
    ⚈  No more than 50 PARTITIONs on a table (open, show table status, etc, are impacted) (fixed in 5.6.6?)
    ⚈  PARTITION BY RANGE is the only useful method.
    ⚈  SUBPARTITIONs are not useful.
    ⚈  The partition field should not be the field first in any key.
    ⚈  It is OK to have an AUTO_INCREMENT as the first part of a compound key, or in a non-UNIQUE index.

It is so tempting to believe that PARTITIONing will solve performance problems. But it is so often wrong.

PARTITIONing splits up one table into several smaller tables. But table size is rarely a performance issue. Instead, I/O time and indexes are the issues.

An AUTO_INCREMENT column must be the first column in some index. (That lets the engine find the 'next' value when opening the table.) It does not have to be the only field, nor does it have to be PRIMARY or UNIQUE. If it is not UNIQUE, you _could_ INSERT a duplicate id if you _explicitly_ provide the number.

Use Cases for PARTITIONing


Use case #1. Perhaps the most common use case where PARTITIONing shines is the the dataset where "old" data is deleted from the table. RANGE PARTITIONing by day (or other unit of time) lets you do a nearly instantaneous DROP PARTITION plus REORGANIZE PARTITION instead of a much slower DELETE. Much of this blog is focused on this use case. This use case is also discussed in
Big DELETEs

Use case #2. INDEXes are inherently one-dimensional. Finding the nearest 10 pizza parlors on a map needs a 2D index. Partition pruning sort of gives a second dimension. See Latitude/Longitude Indexing

Use case #3. This is a bit complicated to explain. Given this combination:
    ⚈  A table's index is too big to be cached, but the index for one partition is cacheable
    ⚈  The index is randomly accessed
    ⚈  Data ingestion would normally be I/O bound due to updating the index
Partitioning can keep all the index "hot" in RAM, thereby avoiding a lot of I/O.

Use case #4. Using EXPORT/IMPORT partition for quickly archiving or importing data. (IMPORTing could be tricky because of the partition key.) See also Transportable Tablespaces for InnoDB Partitions
For now, stick to the section "But how to do this in 5.6?"

Note that almost always, these use cases involve RANGE partitioning, not the other forms.

AUTO_INCREMENT in PARTITION


    ⚈  For AUTO_INCREMENT to work (in any table), it must be the first field in some index. Period.
    ⚈  AUTO_INCREMENT need not be UNIQUE. What you lose: prevention of explicitly inserting a duplicate id. (This is rarely needed, anyway.)
Examples (where id is AUTO_INCREMENT):
    ⚈  PRIMARY KEY (...), INDEX(id)
    ⚈  PRIMARY KEY (...), UNIQUE(id, partition_key) -- not useful
    ⚈  INDEX(id), INDEX(...) (but no UNIQUE keys)
    ⚈  PRIMARY KEY(id), ... -- works only if id is the partition key (not very useful)

The Maintenance Situation


You have a large table that is growing on one end and being pruned on the other. Examples include news, logs, and other transient information. PARTITION BY RANGE is an excellent vehicle for such a table.
    ⚈  DROP PARTITION is much faster than DELETE. (This is the big reason for doing this.)
    ⚈  Queries often limit themselves to 'recent' data, thereby taking advantage of "partition pruning".

Depending on the type of data, and how long before it expires, you might have daily or weekly or hourly (etc) partitions.

The Problem


There is no simple SQL statement to "drop partitions older than 30 days" or "add a new partition for tomorrow". It would be tedious to do this by hand every day.

High Level View of the Code

ALTER TABLE tbl
    DROP PARTITION from20120314;
ALTER TABLE tbl
    REORGANIZE PARTITION future INTO
        from20120415 VALUES LESS THAN (TO_DAYS('2012-04-16')),
        future    VALUES LESS THAN MAXVALUE;
After which you have...
    CREATE TABLE tbl (
        dt DATETIME NOT NULL,  -- or DATE
        ...
        PRIMARY KEY (..., dt),
        UNIQUE KEY (..., dt),
        ...
    )
    PARTITION BY RANGE (TO_DAYS(dt)) (
        start     VALUES LESS THAN (0),
        from20120315 VALUES LESS THAN (TO_DAYS('2012-03-16')),
        from20120316 VALUES LESS THAN (TO_DAYS('2012-03-17')),
        ...
        from20120414 VALUES LESS THAN (TO_DAYS('2012-04-15')),
        from20120415 VALUES LESS THAN (TO_DAYS('2012-04-16')),
        future       VALUES LESS THAN MAXVALUE
    );

Why?


Perhaps you noticed some odd things in the example. Let me explain them.
    ⚈  Partition naming: Make them useful.
    ⚈  from20120415 ... 04-16: Note that the LESS THAN is the next day's date
    ⚈  The "start" partition: See paragraph below.
    ⚈  The "future" partition: This is normally empty, but it can catch overflows; more later.
    ⚈  The range key (dt) must be included in any PRIMARY or UNIQUE key.
    ⚈  The range key (dt) should be last in any keys it is in -- You have already "pruned" with it; it is almost useless in the index.
    ⚈  DATETIME, etc -- I picked this datatype because it is typical for a time series.
    ⚈  There is an extra day (03-16 thru 04-16): The latest day is only partially full.

Why the bogus "start" partition? If an invalid datetime (Feb 31) were to be used, the datetime would turn into NULL. NULLs are put into the first partition. Since any SELECT could have an invalid date (yeah, this stretching things), the partition pruner always includes the first partition in the resulting set of partitions to search. So, if the SELECT must scan the first partition, it would be slightly more efficient if that partition were empty. Hence the bogus "start" partition.
Longer discussion
5.5 eliminates the bogus check, but only if you switch to a new syntax:
    PARTITION BY RANGE COLUMNS(dt) (
    PARTITION day_20100226 VALUES LESS THAN ('2010-02-27'), ...

More on the "future" partition. Sooner or later the cron/EVENT to add tomorrow's partition will fail to run. The worst that could happen is for tomorrow's data to be lost. The easiest way to prevent that is to have a partition ready to catch it, even if this partition is normally always empty.

Having the "future" partition makes the ADD PARTITION script a little more complex. Instead it needs to take tomorrow's data from "future" and put it into a new partition. This is done with the REORAGNIZE command shown. Normally nothing need be moved, and the ALTER takes virtually zero time.

When to do the ALTERs?

    ⚈  DROP if the oldest partition is "too old".
    ⚈  Add 'tomorrow' near the end of today, but don't try to add it twice.
    ⚈  Do not count partitions -- there are two extra ones. Use the partition names or PARTITION_DESCRIPTION.
    ⚈  DROP/Add only once in the script. Rerun the script if you need more.
    ⚈  Run the script too often. For daily partitions, run the script twice a day, or even hourly. Why? Automatic repair.

Variants


As I have said many times, in many places, BY RANGE is perhaps the only useful variant. And a time series is the most common use for PARTITIONing.

    ⚈  (as discussed here) DATETIME/DATE with TO_DAYS()
    ⚈  DATETIME/DATE with TO_DAYS(), but with 7-day intervals
    ⚈  TIMESTAMP with TO_DAYS(). (version 5.1.43 or later)
    ⚈  PARTITION BY RANGE(TIMESTAMP) (version 5.5.15 / 5.6.3)
    ⚈  PARTITION BY RANGE(TO_SECONDS()) (5.6.0)
    ⚈  INT UNSIGNED with constants computed as unix timestamps.
    ⚈  INT UNSIGNED with constants for some non-time-based series.
    ⚈  MEDIUMINT UNSIGNED containing an "hour id": FLOOR(FROM_UNIXTIME(timestamp) / 3600)
    ⚈  Months, Quarters, etc: Concoct a notation that works.

How many partitions?
    ⚈  Under, say, 5 partitions -- you get very little of the benefits.
    ⚈  Over, say, 50 partitions, and you hit inefficiencies elsewhere.
    ⚈  Certain operations (SHOW TABLE STATUS, opening the table, etc) open every partition.
    ⚈  MyISAM, before version 5.6.6, would lock all partitions before pruning!
    ⚈  Partition pruning does not happen on INSERTs (until Version 5.6.7), so INSERT needs to open all the partitions.
    ⚈  8192 partitions is a hard limit (1024 before 5.6.7).
    ⚈  Before "native partitions", each partition consumed a chunk of memory.

Detailed code


Reference implementation, with demo of daily partitions

Postlog


PARTITIONing requires at least MySQL 5.1

More on PARTITIONing
LinkedIn discussion

Original writing -- Oct, 2012; Use cases added Oct, 2014

Future (as envisioned in 2014):
    ⚈  ~5.7.6 will see "native partitioning for InnoDB".
    ⚈  FOREIGN KEY support, perhaps in a later 5.7.xx.
    ⚈  "GLOBAL INDEX" -- this would avoid the need for putting the partition key in every unique index, but make DROP PARTITION costly. This will be farther into the future.

Native partitioning will give:
    ⚈  This will improve performance slightly by combining 2 "handlers" into 1.
    ⚈  Decreased memory usage, especially when using a large number of partitions.


Contact me by posting a question at MySQL Forums :: Partitions
-- Rick James

Rick's MySQL Documents

MySQL Documents by Rick James

Tips, Debugging, HowTos, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Converting from MyISAM to InnoDB -- includes differences between them
Big DELETEs - how to optimize
Compound INDEXes plus other insights into the mysteries of INDEXing
Partition Maintenance (DROP+REORG) for time series
Entity-Attribute-Value -- a common, poorly performing, design patter; plus an alternative
Find the nearest 10 pizza parlors (efficient searching on Latitude + Longitude)
Alter of a Huge table
Latest 10 news articles -- how to optimize the schema and code for such
Pagination, not with OFFSET, LIMIT
Data Warehouse techniques (esp., Summary Tables)
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance
MySQL Limits
Galera Limitations (with Percona XtraDB Cluster / MariaDB)
Rollup Unique User Counts
Best of MySQL Forum