Data Warehouse in MySQL

Table of Contents

Preface

Terminology

Fact Table

Why keep the Fact table?

Batching the load of the Fact table

Batched INSERT Statement

Normalization (Dimension) Table

Batched Normalization

Too many choices?

Purging old data

Master / Replica

Sharding

How Fast? How Big?

How Fast?

Not so Fast?

References

Postlog

Brought to you by Rick James

Preface

This document discusses techniques for improving performance for DW-like tables in MySQL and MariaDB.

    ⚈  How to load large tables.
    ⚈  Normalization.
    ⚈  Developing 'Summary tables' to make 'reports' efficient.
    ⚈  Purging old data.

Details on Summary Tables is covered in the companion document:
Summary Tables

Terminology

This list mirrors "Data Warehouse" terminology.
    ⚈  Fact table -- The one huge table with the 'raw' data.
    ⚈  Summary table -- a redundant table of summarized data that could be use for efficiency
    ⚈  Dimension -- columns that identify aspects of the dataset (region, country, user, SKU, zipcode, ...)
    ⚈  Normalization table (dimension table) -- mapping between strings an ids; used for space and speed.
    ⚈  Normalization -- The process of building the mapping ('New York City' ⇔ 123)

Fact Table

Techniques that should be applied to the huge Fact table.

    ⚈  id INT/BIGINT UNSIGNED NOT NULL AUTO_INCREMENT
    ⚈  PRIMARY KEY (id)
    ⚈  Probably no other INDEXes
    ⚈  Accessed only via id
    ⚈  Most VARCHARs are "normalized"; ids` are stored instead
    ⚈  ENGINE = InnoDB
    ⚈  All "reports" use a summary tables, not the Fact table
    ⚈  Summary tables may be populated from ranges of id other techniques described below

There are exceptions where the Fact table must be accessed to retrieve multiple rows. However, you should minimize the number of INDEXes on the table because they are likely to be costly on INSERT.

Why keep the Fact table?

Once you have built the Summary table(s), there is not much need for the Fact table. One option that you should seriously consider is to not have a Fact table. Or, at least, you could purge old data from it sooner than you purge the Summary tables. Maybe even keep the Summary tables forever.

Case 1: You need to find the raw data involved in some event. But how will you find those row(s)? This is where a secondary index may be required.

If a secondary index is bigger than can be cached in RAM, and if the column(s) being indexed is random, then each row inserted may cause a disk hit to update the index. This limits insert speed to something like 100 rows per second (on HDD disks). Multiple random indexes slow down insertion further. RAID striping and/or SSDs speed up insertion. Write caching helps, but only for bursts.

Case 2: You need some event, but you did not plan ahead with the optimal INDEX. Well, if the data is PARTITIONed on date, so even if you have a clue of when the event occurred, "partition pruning" will keep the query from being too terribly slow.

Case 3: Over time, the application is likely to need new 'reports', which may lead to a new Summary table. At this point, it would be handy to scan through the old data to populate the new table.

Case 4: You find a flaw in the summarization, and need to rebuild an existing Summary table.

Cases 3 and 4 both need the "raw" data. But they don't necessarily need the data sitting in a database table. It could be in the pre-database format (such as log files). So, consider not building the Fact table, but simply keep the raw data, comressed, on some file system.

Batching the load of the Fact table

When talking about billions of rows in the Fact table, it is essentially mandatory that you "batch" the inserts. There are two main ways:

    ⚈  INSERT INTO Fact (.,.,.) VALUES (.,.,.), (.,.,.), ...; -- "Batch insert"
    ⚈  LOAD DATA ...;

A third way is to INSERT or LOAD into a Staging table, then
    ⚈  INSERT INTO Fact SELECT * FROM Staging;
This INSERT..SELECT allows you to do other things, such as normalization. More later.

Batched INSERT Statement

Chunk size should usually be 100-1000 rows.
    ⚈  100-1000 an insert will run 10 times as fast as single-row inserts.
    ⚈  Beyond 100, you may be interfering replication and SELECTs.
    ⚈  Beyond 1000, you are into diminishing returns -- virtually no further performance gains.
    ⚈  Don't go past, say, 1MB for the constructed INSERT statement. This deals with packet sizes, etc. (1MB is unlikely to be hit for a Fact table.)
Decide whether your application should lean toward the 100 or the 1000.

If your data is coming in continually, and you are adding a batching layer, let's do some math. Compute your ingestion rate -- R rows per second.
    ⚈  If R < 10 (= 1M/day = 300M/year) -- single-row INSERTs would probably work fine (that is, batching is optional)
    ⚈  If R < 100 (3B records per year) -- secondary indexes on Fact table may be ok
    ⚈  If R < 1000 (100M records/day) -- avoid secondary indexes on Fact table.
    ⚈  If R > 1000 -- Batching may not work.
Decide how long (S seconds) you can stall loading the data in order to collect a batch of rows.
    ⚈  If S < 0.1s -- May not be able to keep up

If batching seems viable, then design the batching layer to gather for S seconds or 100-1000 rows, whichever comes first.

(Note: Similar math applies to rapid UPDATEs of a table.)

More on
High Speed Ingestion

Normalization (Dimension) Table

Normalization is important in Data Warehouse applications because it significantly cuts down on the disk footprint, thereby improving performance. There are other reasons for normalizing, but space is the important one for DW.

Here is a typical pattern for a Dimension table:
    CREATE TABLE Emails (
        email_id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,  -- don't make bigger than needed
        email VARCHAR(...) NOT NULL,
        PRIMARY KEY (email),  -- for looking up one way
        INDEX(email_id)  -- for looking up the other way (UNIQUE is not needed)
    ) ENGINE = InnoDB;  -- to get clustering
Notes:
    ⚈  MEDIUMINT is 3 bytes with UNSIGNED range of 0..16M; pick SMALLINT, INT, etc, based on a conservative estimate of how many ids you will eventually have.
    ⚈  
datatype sizes
    ⚈  There may be more than one VARCHAR in the table.
    ⚈  InnoDB is better than MyISAM because of way the two keys are structured.
    ⚈  The secondary key is effectively (email_id, email), hence 'covering' for certain queries.
    ⚈  It is OK to not specify an AUTO_INCREMENT to be UNIQUE.
    ⚈  Do not normalize "continuous" values such as dates, floats, and integers.

Batched Normalization

I bring this up as a separate topic because of some of the subtle issues that can happen.

You may be tempted to do
    INSERT IGNORE INTO Foos
        SELECT DISTINCT foo FROM Staging;  -- not wise
It has the problem of burning AUTO_INCREMENT ids. This is because MySQL pre-allocates ids before getting to IGNORE. That could rapidly increase the AUTO_INCREMENT values beyond what you expected.

Better is this...
    INSERT IGNORE INTO Foos
        SELECT DISTINCT foo
            FROM Staging
            LEFT JOIN Foos ON Foos.foo = Staging.foo
            WHERE Foos.foo_id IS NULL;
Notes:
    ⚈  The LEFT JOIN .. IS NULL finds the foos that are not yet in Foos.
    ⚈  This INSERT..SELECT must not be done inside the transaction with the rest of the processing. Otherwise, you add to deadlock risks, leading to burned ids from the rollback.
    ⚈  IGNORE is used in case you are doing the INSERT from multiple processes simultaneously.

Once that INSERT is done, this will find all the foo_ids it needs:
    INSERT INTO Fact (..., foo_id, ...)
        SELECT ..., Foos.foo_id, ...
            FROM Staging
            JOIN Foos ON Foos.foo = Staging.foo;

An advantage of "Batched Normalization" is that you can summarize directly from the Staging table. Two approaches:

Case 1: PRIMARY KEY (dy, foo) and summarization is in lock step with, say, changes in dy.
    ⚈  This approach can have troubles if new data arrives after you have summarized the day's data.
    INSERT INTO Summary (dy, foo, ct, blah_total)
        SELECT  DATE(dt) as dy,
                foo,
                COUNT(*) as ct,
                SUM(blah) as blah_total
            FROM Staging
            GROUP BY 1, 2;
Case 2: (dy, foo) is a non-UNIQUE INDEX.
    ⚈  Same code as Case 1.
    ⚈  By having the index be non-UNIQUE, delayed data simply shows up as extra rows.
    ⚈  You need to take care to avoid summarizing the data twice. (The id on the Fact table may be a good tool for that.)

Case 3: PRIMARY KEY (dy, foo) and summarization can happen anytime.
    INSERT INTO Summary (dy, foo, ct, blah_total)
        ON DUPLICATE KEY UPDATE
            ct = ct + VALUE(ct),
            blah_total = blah_total + VALUE(bt)
        SELECT  DATE(dt) as dy, foo,
                COUNT(*) as ct, SUM(blah) as bt)
            FROM Staging
            GROUP BY 1, 2;

Too many choices?

This document lists a number of ways to do things. Your situation may lead to one approach being more/less acceptable. But, if you are thinking "Just tell me what to do!", then here:

    ⚈  Batch load the raw data into a temporary table (Staging).
    ⚈  Normalize from Staging -- use code in Case 3.
    ⚈  INSERT .. SELECT to move the data from Staging into the Fact table
    ⚈  Summarize from Staging to Summary table(s) via IODKU (Insert ... On Duplicate Key Update).
    ⚈  Drop the Staging

Those techniques should perform well and scale well in most cases. As you develop your situation, you may discover why I described alternative solutions.

Purging old data

Typically the Fact table is PARTITION BY RANGE (10-60 ranges of days/weeks/etc) and needs purging (DROP PARTITION) periodically. This link discusses a safe/clean way to design the partitioning and do the DROPs:
Purging PARTITIONs

Master / Replica


For "read scaling", backup, and failover, use Master-Replica replication or something fancier. Do ingestion only on a single active Master; it replicate to the Replica(s). Generate Reports on the Replica(s).

Galera (including Percona XtraDB Cluster and MariaDB) and MySQL's InnoDB Cluster (Group Replication) provides a way to do "write scaling", but it may add more complexity than benefits. Probably the
"Extreme Design"
is the outline for how to ingest on multiple Galera nodes. Special care needs to be taken on when to use (or not use) transactions, and how to handle a 'deadlock' on COMMIT (which is key to Galera's performance).

Sharding

"Sharding" is the splitting of data across multiple servers. (In contrast, Replication and Clustering have the same data on all servers, requiring all data to be written to all servers.)

With the non-sharding techniques described here, terabyte(s) of data can be handled by a single machine. Tens of terabytes probably requires sharding.

Sharding is beyond the scope of this document. (I'll save this fun topic for another day.)

How Fast? How Big?

With the techniques described here, you may be able to achieve the following performance numbers. I say "may" because every Data Warehouse situation is different, and you may require performance-hurting deviations from what I describe here. I give multiple options for some aspects; these may cover some of your deviations.

One big performance killer is UUID/GUID keys. Since they are very 'random', updates to tables keyed by such are limited to 1 row = 1 disk hit. Plain disks (HDDs) can handle only 100 hits/second. RAID and/or SSD can increase that to something like 1000 hits/sec. Huge amounts of RAM (for caching the random index) are a costly solution. It is possible to turn "type-1" UUIDs into roughly-chronological keys, thereby mittigating the performance problems if the UUIDs are written/read with some chronological clustering. See
UUID discussion

Hardware, etc:
    ⚈  Single SATA drive: 100 IOPs (Input/Output operations per second)
    ⚈  RAID with N physical drives -- 100*N IOPs (roughly)
    ⚈  SSD -- 5 times as fast as rotating media (in this context)
    ⚈  Batch INSERT -- 100-1000 rows is 10 times as fast as INSERTing 1 row at a time (see above)
    ⚈  Purge "old" data -- Do not use DELETE or TRUNCATE, design so you can use DROP PARTITION (see above)
    ⚈  Think of each INDEX (except the PRIMARY KEY on InnoDB) as a separate table
    ⚈  Consider access patterns of each table/index: random vs at-the-end vs something in between

"Count the disk hits" -- back-of-envelope performance analysis
    ⚈  Random accesses to a table/index -- count each as a disk hit.
    ⚈  At-the-end accesses (INSERT chronologically or with AUTO_INCREMENT; range SELECT) -- count as zero hits.
    ⚈  In between (hot/popular ids, etc) -- count as something in between
    ⚈  For INSERTs, do the analysis on each index; add them up.
    ⚈  For SELECTs, do the analysis on the one index used, plus the table. (Use of 2 indexes is rare.)

Insert cost, based on datatype of first column in an index:
    ⚈  AUTO_INCREMENT -- essentially 0 IOPs
    ⚈  DATETIME, TIMESTAMP -- essentially 0 for 'current' times
    ⚈  UUID/GUID -- 1 per insert (terrible)
    ⚈  Others -- depends on their patterns

SELECT cost gets a little tricky:
    ⚈  Range on PRIMARY KEY -- think of it as getting 100 rows per disk hit.
    ⚈  IN on PRIMARY KEY -- 1 disk hit per item in IN
    ⚈  "=" -- 1 hit (for 1 row)
    ⚈  Secondary key -- First compute the hits for the index, then...
    ⚈  Think of each row as needing 1 disk hit.
    ⚈  However, if the rows are likely to be 'near' each other (based on the PRIMARY KEY), then it could be < 1 disk hit/row.

More on Count the Disk Hits

How Fast?


Look at your data; compute raw rows per second. There are about 30M seconds in a year; 86,400 seconds per day. Inserting 30 rows per second becomes a billion rows per year.

10 rows per second is about all you can expect from an ordinary machine (after allowing for various overheads). If you have less than that, you don't have many worries, but still you should probably create Summary tables. If more than 10/sec, then batching, etc, becomes vital. Even on spiffy hardware, 100/sec is about all you can expect without utilizing the techniques here.

Not so Fast?

Let's say your insert rate is only one-tenth of your disk IOPs (eg, 10 rows/sec vs 100 IOPs). Also, let's say your data is not "bursty"; that is, the data comes in somewhat soothly throughout the day.

Note that 10 rows/sec (300M/year) implies maybe 30GB for ( data + indexes + normalization tables + summary tables ) for 1 year. I would call this "not so big".

Still, the normalization and summarization are important. Normalization keeps the data from being, say, twice as big. Summarization speeds up the reports by orders of magnitude.

Let's design and analyse a "simple ingestion scheme" for 10 rows/second, without 'batching'.
    # Normalize:
    $foo_id = SELECT foo_id FROM Foos WHERE foo = $foo;
    if no $foo_id, then
        INSERT IGNORE INTO Foos ...

    # Inserts:
    BEGIN;
        INSERT INTO Fact ...;
        INSERT INTO Summary ... ON DUPLICATE KEY UPDATE ...;
    COMMIT;
    # (plus code to deal with errors on INSERTs or COMMIT)

Depending on the number and randomness of your indexes, etc, 10 Fact rows may (or may not) take less than 100 IOPs.

Also, note that as the data grows over time, random indexes will become less and less likely to be cached. That is, even if runs fine with 1 year's worth of data, it may be in trouble with 2 year's worth.

For those reasons, I started this discussion with a wide margin (10 rows versus 100 IOPs).

References

sec. 3.3.2: Dimensional Model and "Star schema"
Summary Tables

Postlog

Started in 2009;   posted June, 2013;   minor refresh Feb, 2020
-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
    Data Warehouse Overview   Summary Tables   High speed ingestion   Bulk Normalization  

Schema and code design for large Sensor database

Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative

5 methods for 'Find Nearest'

Lat/Lng search to Find the nearest 10 pizza parlors
    Lat/Long representation choices

Z-Order 'find nearest'

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Efficient List of Latest 10 news articles

Build and execute a Pivot SELECT (showing rows as columns)

(Groupwise Max): Efficiently find largest row(s) for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
    Many-to-many mapping table   Handler counts   wp_postmeta   UNION+OFFSET

MySQL Limits -- built-in hard limits
    767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte and bagel ($10) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"