Using a Staging Table for Efficient MySQL Data Warehouse Ingestion

CREATE TABLE Hosts (
    host_id  MEDIUMINT UNSIGNED  NOT NULL AUTO_INCREMENT,
    host_name VARCHAR(99) NOT NULL,
    PRIMARY KEY (host_id),      -- for mapping one direction
    INDEX(host_name, host_id)   -- for mapping the other direction
) ENGINE=InnoDB;                -- InnoDB works best for Many:Many mapping table

Here's how you can use Staging as an efficient way achieve the swap from name to id.

Staging has two fields (for this normalization example):

    host_name VARCHAR(99) NOT NULL,     -- Comes from the insertion proces
    host_id  MEDIUMINT UNSIGNED  NULL,  -- NULL to start with; see code below

Meawhile, the Fact table has:

    host_id  MEDIUMINT UNSIGNED NOT NULL,

SQL #1 (of 2):

    # This should not be in the main transaction, and it should be done with autocommit = ON
    # In fact, it could lead to strange errors if this were part
    #    of the main transaction and it ROLLBACKed.
    INSERT IGNORE INTO Hosts (host_name)
        SELECT DISTINCT s.host_name
            FROM Staging AS s
            LEFT JOIN Hosts AS n  ON n.host_name = s.host_name
            WHERE n.host_id IS NULL;

By isolating this as its own transaction, we get it finished in a hurry, thereby minimizing blocking. By saying IGNORE, we don't care if other threads are 'simultaneously' inserting the same host_names.

There is a subtle reason for the LEFT JOIN. If, instead, it were INSERT IGNORE..SELECT DISTINCT, then the INSERT would preallocate auto_increment ids for as many rows as the SELECT provides. This is very likely to "burn" a lot of ids, thereby leading to overflowing MEDIUMINT unnecessarily. The LEFT JOIN leads to finding just the new ids that are needed (except for the rare possibility of a 'simultaneous' insert by another thread). More rationale: Mapping table (in Index Cookbook)

SQL #2:

    # Also not in the main transaction, and it should be with autocommit = ON
    # This multi-table UPDATE sets the ids in Staging:
    UPDATE   Hosts AS n
        JOIN Staging AS s  ON s.host_name = n.host_name
        SET s.host_id = n.host_id

This gets the IDs, whether already existing, set by another thread, or set by SQL #1.

If the size of Staging changes depending on the busy versus idle times of the day, this pair of SQL statements has another comforting feature. The more rows in Staging, the more efficient the SQL runs, thereby helping compensate for the "busy" times.

The companion Data Warehouse blog
folds SQL #2 into the INSERT INTO Fact. But you may need host_id for further normalization steps and/or Summarization steps, so this explicit UPDATE shown here is often better.

more on burning ids

Flip-Flop Staging

The simple way to stage is to ingest for a while, then batch-process what is in Staging. But that leads to new records piling up waiting to be staged. To avoid that issue, have 2 processes:
    ⚈  one process (or set of processes) for INSERTing into Staging;
    ⚈  one process (or set of processes) to do the batch processing (normalization, summarization).

To keep the processes from stepping on each other, we have a pair of staging tables:
    ⚈  Staging is being INSERTed into;
    ⚈  StageProcess is one being processed for normalization, summarization, and moving to the Fact table.
A separate process does the processing, then swaps the tables:

    DROP   TABLE StageProcess;
    CREATE TABLE StageProcess LIKE Staging;
    RENAME TABLE Staging TO tmp, StageProcess TO Staging, tmp TO StageProcess;

This may not seem like the shortest way to do it, but has these features:
⚈ The DROP + CREATE might be faster than TRUNCATE, which is the desired effect.
⚈ The RENAME is atomic, so the INSERT process(es) never find that Staging is missing.

A variant on the 2-table Flip-Flop is to have a separate Staging table for each Insertion process. The Processing process would run around to each Staging in turn.

A variant on that would be to have a separate Processing process for each Insertion process.

The choice depends on which is faster (Insertion or Processing). There are tradeoffs; a single Processing thread avoids some locks, but lacks some parallelism.

ENGINE Choice

Fact table -- InnoDB, if for no other reason than that a system crash would not need a REPAIR TABLE. (REPAIRing a billion-row MyISAM table can take hours or days.)

Normalization tables -- InnoDB, primarily because it can be done efficiently with 2 indexes, whereas, MyISAM would need 4 to achieve the same efficiency.

Staging -- Lots of options here.
    ⚈  If you have multiple Inserters and a single Staging table, InnoDB is desirable due to row-level, not table-level, locking.
    ⚈  MEMORY may be the fastest and it avoids I/O. This is good for a single staging table.
    ⚈  For multiple Inserters, a separate Staging table for each Inserter is desired.
    ⚈  For multiple Inserters into a single Staging table, InnoDB may be faster. (MEMORY does table-level locking.)
    ⚈  Be aware that the size of a MEMORY table must be declared (max_heap_table_size); some day you will overflow that and have a nasty bug to fix.
    ⚈  With one non-InnoDB Staging table per Inserter, using an explicit LOCK TABLE avoids repeated implicit locks on each INSERT.
    ⚈  But, if you are doing LOCK TABLE and the Processing thread is separate, an UNLOCK is necessary periodically to let the RENAME grab the table.
    ⚈  "Batch INSERTs" (100-1000 rows per SQL) eliminates much of the issues of the above bullet items.

Confused? Lost? There are enough variations in applications that make it impractical to predict what is best. Or, simply good enough. Your ingestion rate may be low enough that you don't hit the brick walls that I am helping you avoid.

Should you do "CREATE TEMPORARY TABLE"? Probably not. Consider Staging as part of the data flow, not to be DROPped.

Summarization

This is mostly covered here: Summary Tables blog
Summarize from the Staging table instead of the Fact table.

Replication Issues

Row Based Replication (RBR) is probably the best option.

The following allows you to keep more of the Ingestion process in the Master, thereby not bogging down the Replica(s) with writes to the Staging table.
    ⚈  RBR
    ⚈  Staging is in a separate database
    ⚈  That database is not replicated (binlog-ignore-db on Master)
    ⚈  In the Processing steps, USE that database, reach into the main db via syntax like "MainDb.Hosts". (Otherwise, the binlog-ignore-db does the wrong thing.)

That way
    ⚈  Writes to Staging are not replicated.
    ⚈  Normalization sends only the few updates to the normalization tables.
    ⚈  Summarization sends only the updates to the summary tables.
    ⚈  Flip-flop does not replicate the DROP, CREATE or RENAME.

Sharding

You could possibly spread the data you are trying ingest across multiple machines in a predictable way (sharding on hash, range, etc). Running "reports" on a sharded Fact table is a challenge unto itself. On the other hand, Summary Tables rarely get too big to manage on a single machine.

For now, Sharding is beyond the scope of this blog.

Push Me vs Pull Me

I have implicitly assumed the data is being pushed into the database. If, instead, you are "pulling" data from some source(s), then there are some different considerations.

Case 1: An hourly upload; run via cron

    1.  Grab the upload, parse it
    2.  Put it into the Staging table
    3.  Normalize -- each SQL in its own transaction (autocommit)
    4.  BEGIN
    5.  Summarize
    6.  Copy from Staging to Fact.
    7.  COMMIT

If you need parallelism in Summarization, you will have to sacrafice the transactional integity of steps 4-7.

Caution: If these steps add up to more than an hour, you are in deep dodo.

Case 2: You are polling for the data

It is probably reasonable to have multiple processes doing this, so it will be detailed about locking.

    0.  Create a Staging table for this polling processor.
Loop:
    1.  With some locked mechanism, decide which 'thing' to poll.
    2.  Poll for the data, pull it in, parse it. (Potentially polling and parsing are significantly costly)
    3.  Put it into the process-specific Staging table
    4.  Normalize -- each SQL in its own transaction (autocommit)
    5.  BEGIN
    6.  Summarize
    7.  Copy from Staging to Fact.
    8.  COMMIT
    9.  Declare that you are finished with this 'thing' (see step 1)
EndLoop.

iblog_file_size should be larger than the change in the STATUS "Innodb_os_log_written" across the BEGIN...COMMIT transaction (for either Case).

References

companion Data Warehouse blog
companion Summary Table blog
a forum thread that prodded me into writing this blog
StackExchange discussion
Some insight into locking
[[https://dba.stackexchange.com/questions/279536/how-to-optimize-mysql-database-of-250-million-rows-for-bulk-inserts-and-selects][Loading Sensor data]] (both initially and continually) A StackOverflow

Postlog

posted Nov, 2014; refreshed Aug, 2015

-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
Data Warehouse Overview Summary Tables High speed ingestion Bulk Normalization

Schema and code design for large Sensor database

Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative

5 methods for 'Find Nearest'

Lat/Lng search to Find the nearest 10 pizza parlors
Lat/Long representation choices

Z-Order 'find nearest'

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Efficient List of Latest 10 news articles

Build and execute a Pivot SELECT (showing rows as columns)

(Groupwise Max): Efficiently find largest row(s) for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
Trouble with UTF-8 If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips, PHP tips, other language tips
utf8 Collations utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
Many-to-many mapping table Handler counts wp_postmeta UNION+OFFSET

MySQL Limits -- built-in hard limits
767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS Analyze SlowLog

My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte and bagel ($10) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "ðŸ™‚"

Using a Staging Table for Efficient MySQL Data Warehouse Ingestion

Table of Contents

The Problem

Overview of Solution

Injection Speed

Normalization