Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Data Warehouse techniques:
Schema and code design for large Sensor database
Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative
Lat/Lng search to Find the nearest 10 pizza parlors
Preface
This document
discusses techniques for improving performance for DW-like tables in MySQL and MariaDB.
⚈ How to load large tables.
⚈ Normalization.
⚈ Developing 'Summary tables' to make 'reports' efficient.
⚈ Purging old data.
Details on Summary Tables is covered in the companion document:
Summary Tables
Terminology
This list mirrors "Data Warehouse" terminology.
⚈ Fact table -- The one huge table with the 'raw' data.
⚈ Summary table -- a redundant table of summarized data that could be use for efficiency
⚈ Dimension -- columns that identify aspects of the dataset (region, country, user, SKU, zipcode, ...)
⚈ Normalization table (dimension table) -- mapping between strings an ids; used for space and speed.
⚈ Normalization -- The process of building the mapping ('New York City' ⇔ 123)
Fact Table
Techniques that should be applied to the huge Fact table.
⚈ id INT/BIGINT UNSIGNED NOT NULL AUTO_INCREMENT
⚈ PRIMARY KEY (id)
⚈ Probably no other INDEXes
⚈ Accessed only via id
⚈ Most VARCHARs are "normalized"; ids` are stored instead
⚈ ENGINE = InnoDB
⚈ All "reports" use a summary tables, not the Fact table
⚈ Summary tables may be populated from ranges of id other techniques described below
There are exceptions where the Fact table must be accessed to
retrieve multiple rows.
However, you should minimize the number of INDEXes on the table
because they are likely to be costly on INSERT.
Why keep the Fact table?
Once you have built the Summary table(s), there is not much need for
the Fact table. One option that you should seriously consider
is to not have a Fact table. Or, at least, you could purge old data from it sooner
than you purge the Summary tables.
Maybe even keep the Summary tables forever.
Case 1: You need to find the raw data involved in some event.
But how will you find those row(s)?
This is where a secondary index may be required.
If a secondary index is bigger than can be cached in RAM, and if
the column(s) being indexed is random, then each row inserted may cause a disk
hit to update the index. This limits insert speed to something like 100 rows per second
(on HDD disks). Multiple random indexes slow down insertion further.
RAID striping and/or SSDs speed up insertion. Write caching helps, but only for bursts.
Case 2: You need some event, but you did not plan ahead with the
optimal INDEX.
Well, if the data is PARTITIONed on date, so even if you have a clue
of when the event occurred, "partition pruning" will keep the
query from being too terribly slow.
Case 3: Over time, the application is likely to need new 'reports',
which may lead to a new Summary table. At this point, it would be handy
to scan through the old data to populate the new table.
Case 4: You find a flaw in the summarization, and need to rebuild
an existing Summary table.
Cases 3 and 4 both need the "raw" data. But they don't necessarily
need the data sitting in a database table. It could be in
the pre-database format (such as log files). So, consider not building
the Fact table, but simply keep the raw data, comressed, on some file system.
Batching the load of the Fact table
When talking about billions of rows in the Fact table, it is essentially
mandatory that you "batch" the inserts. There are two main ways:
⚈ INSERT INTO Fact (.,.,.) VALUES (.,.,.), (.,.,.), ...; -- "Batch insert"
⚈ LOAD DATA ...;
A third way is to INSERT or LOAD into a Staging table, then
⚈ INSERT INTO Fact SELECT * FROM Staging;
This INSERT..SELECT allows you to do other things, such as normalization. More later.
Batched INSERT Statement
Chunk size should usually be 100-1000 rows.
⚈ 100-1000 an insert will run 10 times as fast as single-row inserts.
⚈ Beyond 100, you may be interfering replication and SELECTs.
⚈ Beyond 1000, you are into diminishing returns -- virtually no further performance gains.
⚈ Don't go past, say, 1MB for the constructed INSERT statement. This deals with packet sizes, etc. (1MB is unlikely to be hit for a Fact table.)
Decide whether your application should lean toward the 100 or the 1000.
If your data is coming in continually, and you are adding a batching layer,
let's do some math.
Compute your ingestion rate -- R rows per second.
⚈ If R < 10 (= 1M/day = 300M/year) -- single-row INSERTs would probably work fine (that is, batching is optional)
⚈ If R < 100 (3B records per year) -- secondary indexes on Fact table may be ok
⚈ If R < 1000 (100M records/day) -- avoid secondary indexes on Fact table.
⚈ If R > 1000 -- Batching may not work.
Decide how long (S seconds) you can stall loading the data in order
to collect a batch of rows.
⚈ If S < 0.1s -- May not be able to keep up
If batching seems viable, then design the batching layer to gather
for S seconds or 100-1000 rows, whichever comes first.
(Note: Similar math applies to rapid UPDATEs of a table.)
More on
High Speed Ingestion
Normalization (Dimension) Table
Normalization is important in Data Warehouse applications because it significantly
cuts down on the disk footprint, thereby improving performance.
There are other reasons for normalizing, but space
is the important one for DW.
Here is a typical pattern for a Dimension table:
CREATE TABLE Emails (
email_id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT, -- don't make bigger than needed
email VARCHAR(...) NOT NULL,
PRIMARY KEY (email), -- for looking up one way
INDEX(email_id) -- for looking up the other way (UNIQUE is not needed)
) ENGINE = InnoDB; -- to get clustering
Notes:
⚈ MEDIUMINT is 3 bytes with UNSIGNED range of 0..16M; pick SMALLINT, INT, etc, based on a conservative estimate of how many ids you will eventually have.
⚈ datatype sizes
⚈ There may be more than one VARCHAR in the table.
⚈ InnoDB is better than MyISAM because of way the two keys are structured.
⚈ The secondary key is effectively (email_id, email), hence 'covering' for certain queries.
⚈ It is OK to not specify an AUTO_INCREMENT to be UNIQUE.
⚈ Do not normalize "continuous" values such as dates, floats, and integers.
Batched Normalization
I bring this up as a separate topic because of some of the subtle issues that can happen.
You may be tempted to do
INSERT IGNORE INTO Foos
SELECT DISTINCT foo FROM Staging; -- not wise
It has the problem of burning AUTO_INCREMENT ids.
This is because MySQL pre-allocates ids before getting to IGNORE.
That could rapidly increase the AUTO_INCREMENT values beyond what you expected.
Better is this...
INSERT IGNORE INTO Foos
SELECT DISTINCT foo
FROM Staging
LEFT JOIN Foos ON Foos.foo = Staging.foo
WHERE Foos.foo_id IS NULL;
Notes:
⚈ The LEFT JOIN .. IS NULL finds the foos that are not yet in Foos.
⚈ This INSERT..SELECT must not be done inside the transaction with the rest of the processing. Otherwise, you add to deadlock risks, leading to burned ids from the rollback.
⚈ IGNORE is used in case you are doing the INSERT from multiple processes simultaneously.
Once that INSERT is done, this will find all the foo_ids it needs:
INSERT INTO Fact (..., foo_id, ...)
SELECT ..., Foos.foo_id, ...
FROM Staging
JOIN Foos ON Foos.foo = Staging.foo;
An advantage of "Batched Normalization" is that you can
summarize directly from the Staging table. Two approaches:
Case 1: PRIMARY KEY (dy, foo) and summarization is in lock step with, say, changes in dy.
⚈ This approach can have troubles if new data arrives after you have summarized the day's data.
INSERT INTO Summary (dy, foo, ct, blah_total)
SELECT DATE(dt) as dy,
foo,
COUNT(*) as ct,
SUM(blah) as blah_total
FROM Staging
GROUP BY 1, 2;
Case 2: (dy, foo) is a non-UNIQUE INDEX.
⚈ Same code as Case 1.
⚈ By having the index be non-UNIQUE, delayed data simply shows up as extra rows.
⚈ You need to take care to avoid summarizing the data twice. (The id on the Fact table may be a good tool for that.)
Case 3: PRIMARY KEY (dy, foo) and summarization can happen anytime.
INSERT INTO Summary (dy, foo, ct, blah_total)
ON DUPLICATE KEY UPDATE
ct = ct + VALUE(ct),
blah_total = blah_total + VALUE(bt)
SELECT DATE(dt) as dy, foo,
COUNT(*) as ct, SUM(blah) as bt)
FROM Staging
GROUP BY 1, 2;
Too many choices?
This document lists a number of ways to do things.
Your situation may lead to one approach being more/less acceptable.
But, if you are thinking "Just tell me what to do!", then here:
⚈ Batch load the raw data into a temporary table (Staging).
⚈ Normalize from Staging -- use code in Case 3.
⚈ INSERT .. SELECT to move the data from Staging into the Fact table
⚈ Summarize from Staging to Summary table(s) via IODKU (Insert ... On Duplicate Key Update).
⚈ Drop the Staging
Those techniques should perform well and scale well in most cases.
As you develop your situation, you may discover why I described
alternative solutions.
Purging old data
Typically the Fact table is PARTITION BY RANGE (10-60 ranges of days/weeks/etc)
and needs purging (DROP PARTITION) periodically.
This link discusses a safe/clean way to design the partitioning
and do the DROPs:
Purging PARTITIONs
Master / Replica
For "read scaling", backup, and failover, use Master-Replica replication or something fancier.
Do ingestion only on a single active Master; it replicate to the Replica(s).
Generate Reports on the Replica(s).
Galera (including Percona XtraDB Cluster and MariaDB) and MySQL's InnoDB Cluster (Group Replication) provides
a way to do "write scaling", but it may add more complexity than benefits.
Probably the
"Extreme Design"
is the outline for how to ingest on multiple Galera nodes.
Special care needs to be taken on when to use (or not use) transactions,
and how to handle a 'deadlock' on COMMIT (which is key to Galera's performance).
Sharding
"Sharding" is the splitting of data across multiple servers.
(In contrast, Replication and Clustering have the same data on all servers,
requiring all data to be written to all servers.)
With the non-sharding techniques described here, terabyte(s) of data
can be handled by a single machine. Tens of terabytes probably requires
sharding.
Sharding is beyond the scope of this document.
(I'll save this fun topic for another day.)
How Fast? How Big?
With the techniques described here, you may be able to achieve the following
performance numbers.
I say "may" because every Data Warehouse situation is different, and
you may require performance-hurting deviations from what I describe here.
I give multiple options for some aspects;
these may cover some of your deviations.
One big performance killer is UUID/GUID keys. Since they are
very 'random', updates to tables keyed by such are limited to 1 row = 1 disk hit.
Plain disks (HDDs) can handle only 100 hits/second. RAID and/or SSD can increase
that to something like 1000 hits/sec. Huge amounts of RAM (for caching
the random index) are a costly solution.
It is possible to turn "type-1" UUIDs into roughly-chronological keys,
thereby mittigating the performance problems if the UUIDs are written/read
with some chronological clustering. See
UUID discussion
Hardware, etc:
⚈ Single SATA drive: 100 IOPs (Input/Output operations per second)
⚈ RAID with N physical drives -- 100*N IOPs (roughly)
⚈ SSD -- 5 times as fast as rotating media (in this context)
⚈ Batch INSERT -- 100-1000 rows is 10 times as fast as INSERTing 1 row at a time (see above)
⚈ Purge "old" data -- Do not use DELETE or TRUNCATE, design so you can use DROP PARTITION (see above)
⚈ Think of each INDEX (except the PRIMARY KEY on InnoDB) as a separate table
⚈ Consider access patterns of each table/index: random vs at-the-end vs something in between
"Count the disk hits" -- back-of-envelope performance analysis
⚈ Random accesses to a table/index -- count each as a disk hit.
⚈ At-the-end accesses (INSERT chronologically or with AUTO_INCREMENT; range SELECT) -- count as zero hits.
⚈ In between (hot/popular ids, etc) -- count as something in between
⚈ For INSERTs, do the analysis on each index; add them up.
⚈ For SELECTs, do the analysis on the one index used, plus the table. (Use of 2 indexes is rare.)
Insert cost, based on datatype of first column in an index:
⚈ AUTO_INCREMENT -- essentially 0 IOPs
⚈ DATETIME, TIMESTAMP -- essentially 0 for 'current' times
⚈ UUID/GUID -- 1 per insert (terrible)
⚈ Others -- depends on their patterns
SELECT cost gets a little tricky:
⚈ Range on PRIMARY KEY -- think of it as getting 100 rows per disk hit.
⚈ IN on PRIMARY KEY -- 1 disk hit per item in IN
⚈ "=" -- 1 hit (for 1 row)
⚈ Secondary key -- First compute the hits for the index, then...
⚈ Think of each row as needing 1 disk hit.
⚈ However, if the rows are likely to be 'near' each other (based on the PRIMARY KEY), then it could be < 1 disk hit/row.
More on
Count the Disk Hits
How Fast?
Look at your data; compute raw rows per second.
There are about 30M seconds in a year; 86,400 seconds per day.
Inserting 30 rows per second becomes a billion rows per year.
10 rows per second is about all you can expect from an
ordinary machine (after allowing for various overheads).
If you have less than that, you don't have many worries,
but still you should probably create Summary tables.
If more than 10/sec, then batching, etc, becomes vital.
Even on spiffy hardware, 100/sec is about all you can
expect without utilizing the techniques here.
Not so Fast?
Let's say your insert rate is only one-tenth of your disk IOPs (eg, 10 rows/sec vs 100 IOPs).
Also, let's say your data is not "bursty"; that is, the data comes in somewhat
soothly throughout the day.
Note that 10 rows/sec (300M/year) implies maybe 30GB for ( data + indexes + normalization tables +
summary tables ) for 1 year. I would call this "not so big".
Still, the normalization and summarization are important.
Normalization keeps the data from being, say, twice as big.
Summarization speeds up the reports by orders of magnitude.
Let's design and analyse a "simple ingestion scheme" for 10 rows/second,
without 'batching'.
# Normalize:
$foo_id = SELECT foo_id FROM Foos WHERE foo = $foo;
if no $foo_id, then
INSERT IGNORE INTO Foos ...
# Inserts:
BEGIN;
INSERT INTO Fact ...;
INSERT INTO Summary ... ON DUPLICATE KEY UPDATE ...;
COMMIT;
# (plus code to deal with errors on INSERTs or COMMIT)
Depending on the number and randomness of your indexes, etc,
10 Fact rows may (or may not) take less than 100 IOPs.
Also, note that as the data grows over time, random indexes
will become less and less likely to be cached.
That is, even if runs fine with 1 year's worth of data,
it may be in trouble with 2 year's worth.
For those reasons, I started this discussion with a wide margin
(10 rows versus 100 IOPs).
References
sec. 3.3.2: Dimensional Model and "Star schema"
Summary Tables
Postlog
Started in 2009;
posted June, 2013;
minor refresh Feb, 2020
-- Rick James
MySQL Documents by Rick James
HowTo Techniques for Optimizing Tough Tasks:
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse Overview
Summary Tables
High speed ingestion
Bulk Normalization
Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Alter of a Huge table -- Mostly obviated by 5.6
Efficient List of Latest 10 news articles
Build and execute a Pivot SELECT (showing rows as columns)
(Groupwise Max): Efficiently find largest row(s) for each group
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips,
PHP tips,
other language tips
utf8 Collations
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
Handler counts
wp_postmeta
UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
Analyze SlowLog
My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: