The Problem
You have lots of "sensor" data arriving at a regular pace. You need to store it compactly. You need to retrieve data for graphing, etc.
The problems might include the massive amount of disk space needed, sluggish queries, purging old data, summarizing data, etc.
I will discuss general cases, plus mention a few specific for weather, stocks, vehicles, etc. When I say "sensor", think "vehicle spedometer", "stock ticker", "weather gauge", etc.
This discussion applies to MySQL and MariaDB.
Sample schemas
This first schema assumes each metric is a separate row:
A generic schema...
CREATE TABLE Readings ( -- Example of multiple rows for each site
thing_id SMALLINT UNSIGNED NOT NULL, -- vehicle or other collection of sensors
dt TIMESTAMP NOT NULL, -- `TIMESTAMP` avoids Daylight Savings shifts
sensor_id SMALLINT UNSIGNED NOT NULL,
value FLOAT NOT NULL, -- `FLOAT` (4 bytes) .. `DECIMAL(6,4)` (3 bytes), etc
PRIMARY KEY (thing_id, dt, sensor_id) -- (need to see queries to pick optimal order)
INDEX(dt),
INDEX(sensor_id, dt) -- if you need to look at all of one type of sensor
) ENGINE=InnoDB
Some specific schemata...
Using 4 columns instead of 4 rows will cut the disk footprint by a factor of 2 or 3. So, it is often better to use the following, assuming one input delivers multiple metrics for one thing at one time, and they are put into a single row:
CREATE TABLE WeatherReadings (
location_id SMALLINT UNSIGNED NOT NULL, -- 2 bytes (max of 65K); JOIN to another table
dt TIMESTAMP NOT NULL,
temp DECIMAL(4,1) NOT NULL -- 3 bytes good range for Fahrenheit, acceptable precision
-- DECIMAL(3,1) or (4,2) for centigrade is only 2 bytes
humidity DECIMAL(2,0) or TINYINT -- 1 byte; `DECIMAL`(3,1)` -- 2 bytes
rain DECIMAL (4,2) -- 2 bytes; sufficient range for 1 hour, but not for one day
wind_speed ??
wind_bearing SMALLINT NOT NULL -- 2 bytes 0-359
...
PRIMARY KEY (location_id, dt),
INDEX(dt)
) ENGINE=InnoDB
CREATE TABLE VehicleReadings (
vehicle_id SMALLINT UNSIGNED NOT NULL, -- 2 bytes (max of 65K)
dt TIMESTAMP NOT NULL,
engine_on TINYINT NOT NULL, -- 1 byte; 0=OFF,1=ON
latitude FLOAT NOT NULL -- 4 bytes
longitude FLOAT NOT NULL -- 4 bytes
-- Note: that lat/lng has a resolution of 1.7 m / 5.6 ft (or better) anywhere in world
speed ... SIGNED NOT NULL, -- 4 bytes FLOAT, 3 bytes DECIMAL(4,1) for mph or km/h
direction SMALLINT NOT NULL -- 2 bytes 0-359
flags SET('wipers_on', 'lights_on', ...) -- 1 byte per 8; not cheap to search
obd_errors TEXT NOT NULL, -- usually empty
...
PRIMARY KEY (vehicle_id, dt),
INDEX(dt),
FULLTEXT(obd_errors) -- might be useful
) ENGINE=InnoDB
CREATE TABLE StockQuotes (
ticker VARCHAR(5) NOT NULL, -- <= 6 bytes
dt TIMESTAMP NOT NULL, -- or just DATE (3 bytes) for daily quotes
quote ?? -- Old quotes where in fractions; most new quotes are on 100ths,
-- but some penny stocks need more digits
open, close, high, low -- useful on daily (in place of quote); also in summary
volume ?? -- 'indexes' may exceed size of 'INT'.
PRIMARY KEY (ticker, dt),
INDEX(dt)
) ENGINE=InnoDB CHARACTER SET ascii -- at least for the tickers
FLOAT(10,8) is folly. FLOAT has only 24 bits of precision. Just say FLOAT or switch to a suitable DECIMAL(...). DOUBLE has a lot more precision, but takes a lot more space (8 bytes) and is rarely useful for anything that a sensor can measure.
SUM will convert to DOUBLE, so SUM(rain) will not overflow even if individual readings are constrained.
A summary table would need a datatype bigger than DECIMAL (4,2) for storing, say, annual rainfall. So, use DECIMAL(6,2) or FLOAT in summary tables.
The choice of 'units' will affect the specification of DECIMAL and may impact the number of bytes. Perhaps it is best to settle on the obvious metric units and convert as you store. FLOAT has sufficient range and precision for any sensor I can think of.
Monetary values should _not_ use FLOAT; instead, use DECIMAL with sufficient width. A simple rule of thumb: DECIMAL(m,n) occupies about m/2 bytes (fixed length).
A 1-byte difference in one column adds up to a 1GB difference for a billion rows; double that if it is in an index.
Lookup / Normalization
The sensor might be better called sensor_id, with another table mapping sensor_id to the name, location, type, brand, etc, of the particular sensor.
Indexing
When data arrives, it probably includes multiple sensor readings for a single timestamp. If you use the one-reading-per-row schema and have PRIMARY KEY (sensor, dt), then the inserts will go into several different spots in the table. (Keep in mind that the data is in a BTree ordered by the PK.) Even if you have a thousand sensors, this is not a burden on the system. There will be a thousand "hot spots"; inserts will be efficient.
If there are queries that look up all the readings for all sensors for a day, then add INDEX(dt) to facilitate such. Note that partition pruning (qv) will help (somewhat) with this, even if you don't have an INDEX.
If there is any chance for two readings for the same sensor in the same second, then decide
⚈ That is an error.
⚈ Take the first one by using INSERT IGNORE.
⚈ Take the last one by using INSERT ... ON DUPLICATE KEY UPDATE ... ("IODKU"/"upsert")
⚈ Don't use REPLACE, it probably takes longer since it is DELETE + INSERT. That is because it always tries to Delete.
⚈ Use fractional seconds as a partial remedy. (DATETIME(6) and TIMESTAMP(6) gives you microseconds but takes 8 bytes. DATETIME(2) and TIMESTAMP(2) takes only 1 extra byte over the base datatype.)
The Fact table should have very few indexes. Most other needs can be achieved by extra indexes on Summary tables.
Writing a date (or datetime) range:
SELECT *
FROM Readings
WHERE dt >= '2021-02-28'
AND dt < '2021-02-28' + INTERVAL 7 DAY
Note how that range works correctly for DATE, DATETIME, or TIMESTAMP, and handles leap days easily.
With DST (Daylight Stutter Time), once a year, DATETIME loses an hour; once a year, it repeats an hour. Hence, my recommendation to use TIMESTAMP instead.
Summarization
Build another table, a "Summary Table", to contain summary info for the minute or hour or day (depending on the frequency of data arrival and reporting needs).
⚈ Every second -- summarize by minute
⚈ up to 1 minute -- by hour
⚈ less frequent readings -- by day
The goal is to fold at least 10 'raw' rows into one 'summarized' row. Daily summarization can be rolled up into weekly summaries, etc. However, this is usually not needed.
Your "Fact" table (the Readings mentioned above) is likely to grow to a billion rows; this has challenges. Much of what this blog says is aimed at mitigating that.
I will assume that readings are received quite regularly. This lets you say AVG(value) when summarizing the hourly temperatures into a daily summary.
If the readings are sporadic, then it is probably more appropriate to keep sum_value from SUM(value) and COUNT(*) in ct, then compute, in the SELECT, the average as SUM(sum_value) / SUM(ct), just as will be done when rolling up to broader periods.
CREATE TABLE Summary (
sensor_id SMALLINT UNSIGNED NOT NULL, -- match entry in `Readings`
sensor_name VARCHAR(...) NOT NULL, -- match lookup value
dy DATE NOT NULL, -- assuming daily summarization
ct, -- how many readings were taken
sum_value FLOAT NOT NULL, -- or avg_value; see discussion
hi FLOAT NOT NULL, -- from MAX(value)
lo FLOAT NOT NULL,
PRIMARY KEY (sensor, dy), -- no "id"
INDEX(dy) -- if you might need "all sensors over a date range"
) ENGINE=InnoDB
Notes:
⚈ sensor_name -- it is OK to "denormalize" the Summary table. (Purists will argue with me.)
⚈ dy DATE if summarizing to a "day", else DATETIME or TIMESTAMP if using a smaller unit
⚈ To, for example, round to the hour: LEFT(MIN(dt), 13) or other calculations. (More discussion below.)
⚈ hi and lo are MAX(value) and MIN(value) in case you want to see, for example, the temperature range for a day. (Part of a "candlestick" in stock quotes.)
⚈ median and "time of hi" are more complex to compute. MySQL 8.0 and MariaDB 10.4 have a MEDIAN() function, but otherwise, I recommend writing some client code to compute such things.
⚈ If using DECIMAL, be sure to allow more space than in the Readings table so that it does not overflow.
⚈ DOUBLE (8 bytes) is overkill for most "sensor" data. FLOAT (4 bytes; about 7 significant digits) is usually more than sufficient, even for, say, rainfall for a year.
⚈ Do not use (m,n) on FLOAT or DOUBLE. (MySQL 8.0.17 deprecates it.)
⚈ Contrary to the discussion above, FLOAT (with 4 bytes) is likely to be better than DECIMAL, especially for values that are SUMmed.
(More discussion of Summary tables below.)
Size
Sensor data tends to be huge. So, it is best to minimize datatype sizes when initially designing the schema. This may avoid rapidly running out of disk space. And it will help avoid the pain of later doing ALTER TABLE on a huge table.
Do you really need data that is ten years old? You can probably live with the summary data. That is, you should seriously consider purging "old" raw data (in the "Fact" table), but keep the Summary Table(s) 'forever'. After all, the latter will be about one-tenth the size of the raw data.
And if you started with the PARTITIONing mentioned, you are ready for an efficient way to purge old raw data.
PARTITIONing for Purging old data
If you will hold all the data forever, use a single table. If you will purge old data, use PARTITION BY RANGE.
Partitioning is not likely to provide any performance advantage for SELECTs.
Manually splitting the data into tables by the year is even more clumsy and ill-advised (compared to Partitioning).
To install PARTITIONing after you have a billion rows will probably lock the table for hour(s). You would need some way to stall (or lose) INSERTs while doing it.
A better way to deal with the mass of data is to summarize all the data and keep only "recent" raw data. At that point, PARTITION BY RANGE(...) is the way to go. That allows DROP PARTITION to do the heavy lifting of DELETE without taking much time. That is the main way that Partitioning can help.
There is rarely a need to PARTITION a summary table. It is usually practical to keep that data 'forever'.
Note: When changing to/from PARTITIONing, you must rethink all the indexes, including the PRIMARY KEY. Given that the table is PARTITION BY RANGE(TO_DAYS(dy)), this is likely to be still valid: PRIMARY KEY(sensor_id, dy), except that dy now is the partition by value (start of day/hour/week/whatever). There is probably no need for INDEX(dy) since partitioning provides that (mostly).
The main benefit of Partitioning in this context is being able to rapidly delete 'old' data via DROP PARTITION instead of DELETE, plus REORGANIZE PARTITION to set up for the 'next' day/month/whatever.
More:
Partitioning
Y2K / Y2038
TIMESTAMPs will run out of steam in 2038. DATETIMEs have a problem at 2am-3am once a year when DST changes.
y2038 -- Well, that's 14 years off. Think about the computer you had 14 years ago. Is it still alive and being used? Instead, you should plan to replace all hardware and software every 5 years or so, or else it becomes nearly impossible to upgrade the server + network + OS + database + app, etc. Y2038 will soon be a hot topic and 'solutions' will be debated. Probably there will be a 'good' solution by your next 5-year upgrade of hardware+software+this_blog.
Zeros
Some readings (eg, rainfall) are often zero. It is likely to be best to not store any zero values.
Some readings depend on other things. For example, "distance traveled" for a vehicle is (hopefully) meaningless when the engine is off. Since you will be recording several readings every few seconds, it might be best not to store most of them if the engine is off.
Note that AVG(rainfall) or AVG(speed) will differ if zeros are missing versus present. This could be a strong reason for collecting SUM(rainfall) instead of AVG. And think through what SUM(ct) should be.
Multiple Sensors
Plan A: One reading per row. (This is what I am focusing on in this document.) This gets in trouble when the values are not all numeric, but instead "On/Off", "transmission in Drive", etc. In this case, leave out the rows rain=0. Then use LEFT JOIN and COALESCE (or IFNULL) to retrieve "0" from a missing row.
Plan B: Multiple columns (eg, temp, rain, wind_speed, wind_direction) in a single table. In this case, store "0", not "NULL" when there is no rain.
You should decide what makes sense in your situation.
Plan B saves a lot of space but does not have much flexibility. Adding a new sensor is costly (very slow ALTER), though not too bad in MySQL 8.0's ALTER TABLE.
Summarize Immediately?
Build and maintain a "Summary Table" for most reports, graphs, etc. This would typically be a table with one or more 'dimensions' plus a 'date', plus counts and subtotals for each date. It allows for such reports to be presented much faster.
_Summary blog_
For summarizing on-the-fly, use IODKU (upsert), to keep the Summary Table(s) up to date. For really rapid ingestion, there may be an issue of keeping up. (This is not likely to be a problem for Sensor data. More discussion about ingestion speed:
High speed ingestion
INSERT INTO Summary (sensor, dy, value, ct, sum_value, lo, hi) VALUES ( ?, -- sensor_id ?, -- sensor -- denormalized from table Sensors DATE(?), -- or another mapping to the summary table's resolution ?, -- value 1, -- for ct ?, -- value (for sum_value) ?, -- value (for open) ?, -- value (for close) ?, -- value (for lo) ?, -- value (for hi) ON DUPLICATE KEY UPDATE ct = VALUES(ct) + 1, sum_value = sum_value + VALUES(value), -- no action for open; the 'first' value is "open" close = VALUES(value), lo = LEAST(lo, VALUES(lo)), hi = GREATEST(hi, VALUES(hi)) ;
ON DUPLICATE KEY UPDATE ct = ct + 1, sum_value = sum_value + NEW.value, close = NEW.value, lo = LEAST(lo, NEW.lo), hi = GREATEST(hi, NEW.hi) ;
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Data Warehouse Overview
Summary Tables
High speed ingestion
Bulk Normalization
Schema and code design for large Sensor database
Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative
Lat/Lng search to Find the nearest 10 pizza parlors
Lat/Long representation choices
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Alter of a Huge table -- Mostly obviated by 5.6
Efficient List of Latest 10 news articles
Build and execute a Pivot SELECT (showing rows as columns)
(Groupwise Max): Efficiently find largest row(s) for each group
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
Python tips,
PHP tips,
other language tips
utf8 Collations
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
Handler counts
wp_postmeta
UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
Analyze SlowLog
My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: