The Problem
You are ingesting lots of data into a MySQL or MariaDB table.
Performance is bottlenecked in the INSERT area.
This will be couched in terms of Data Warehousing, with a huge Fact table and
Summary (aggregation) tables.
Overview of Solution
⚈ Have a separate staging table.
⚈ Inserts go into Staging.
⚈ Normalization and Summarization reads Staging, not Fact.
⚈ After normalizing, the data is copied from Staging to Fact.
Staging is one (or more) tables in which the data lives only long enough to be handed
off to Normalization, Summary, and the Fact tables.
Since we are probably talking about a billion-row table, shrinking the width of the Fact table by normalizing
(as mentioned here).
Changing an INT to a MEDIUMINT will save a GB.
Replacing a string by an id (normalizing) saves many GB.
This helps disk space and cacheability, hence speed.
Injection Speed
Some variations:
⚈ Big dump of data once an hour, versus continual stream of records.
⚈ The input stream could be single-threaded or multi-threaded.
⚈ You might have 3rd party software tying your hands.
Generally the fastest injection rate can be achieved by "staging" the INSERTs in some way,
then batch processing the staged records.
This blog discusses various techniques for staging and batch processing.
Normalization
Let's say your Input has a VARCHAR host_name column, but you need to
turn that into a smaller MEDIUMINT host_id in the Fact table.
The "Normalization" table, as I call it, looks something like
CREATE TABLE Hosts (
host_id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
host_name VARCHAR(99) NOT NULL,
PRIMARY KEY (host_id), -- for mapping one direction
INDEX(host_name, host_id) -- for mapping the other direction
) ENGINE=InnoDB; -- InnoDB works best for Many:Many mapping table
Here's how you can use Staging as an efficient way achieve the swap from name to id.
Staging has two fields (for this normalization example):
host_name VARCHAR(99) NOT NULL, -- Comes from the insertion proces
host_id MEDIUMINT UNSIGNED NULL, -- NULL to start with; see code below
Meawhile, the Fact table has:
host_id MEDIUMINT UNSIGNED NOT NULL,
SQL #1 (of 2):
# This should not be in the main transaction, and it should be done with autocommit = ON
# In fact, it could lead to strange errors if this were part
# of the main transaction and it ROLLBACKed.
INSERT IGNORE INTO Hosts (host_name)
SELECT DISTINCT s.host_name
FROM Staging AS s
LEFT JOIN Hosts AS n ON n.host_name = s.host_name
WHERE n.host_id IS NULL;
By isolating this as its own transaction, we get it finished in a hurry, thereby minimizing blocking.
By saying IGNORE, we don't care if other threads are 'simultaneously' inserting
the same host_names.
There is a subtle reason for the LEFT JOIN.
If, instead, it were INSERT IGNORE..SELECT DISTINCT, then the INSERT would preallocate
auto_increment ids for as many rows as the SELECT provides. This is very likely to "burn"
a lot of ids, thereby leading to overflowing MEDIUMINT unnecessarily.
The LEFT JOIN leads to finding just the new ids that are needed
(except for the rare possibility of a 'simultaneous' insert by another thread).
More rationale:
Mapping table (in Index Cookbook)
SQL #2:
# Also not in the main transaction, and it should be with autocommit = ON # This multi-table UPDATE sets the ids in Staging: UPDATE Hosts AS n JOIN Staging AS s ON s.host_name = n.host_name SET s.host_id = n.host_idThis gets the IDs, whether already existing, set by another thread, or set by SQL #1.
DROP TABLE StageProcess; CREATE TABLE StageProcess LIKE Staging; RENAME TABLE Staging TO tmp, StageProcess TO Staging, tmp TO StageProcess;This may not seem like the shortest way to do it, but has these features:
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Data Warehouse Overview
Summary Tables
High speed ingestion
Bulk Normalization
Schema and code design for large Sensor database
Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative
Lat/Lng search to Find the nearest 10 pizza parlors
Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Alter of a Huge table -- Mostly obviated by 5.6
Efficient List of Latest 10 news articles
Build and execute a Pivot SELECT (showing rows as columns)
(Groupwise Max): Efficiently find largest row(s) for each group
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips,
PHP tips,
other language tips
utf8 Collations
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
Handler counts
wp_postmeta
UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
Analyze SlowLog
My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: