MySQL database design for Sensor Data

Table of Contents

The Problem

Sample schemas

Lookup / Normalization

Indexing

Writing a date (or datetime) range:

Summarization

Size

PARTITIONing for Purging old data

Y2K / Y2038

Zeros

Multiple Sensors

Summarize Immediately?

Summarize Nightly?

Multiple summary tables?

Other statistics

Graphing

max_connections

Postlog

References

Brought to you by Rick James


The Problem


You have lots of "sensor" data arriving at a regular pace. You need to store it compactly. You need to retrieve data for graphing, etc.

The problems might include the massive amount of disk space needed, sluggish queries, purging old data, summarizing data, etc.

I will discuss general cases, plus mention a few specific for weather, stocks, vehicles, etc. When I say "sensor", think "vehicle spedometer", "stock ticker", "weather gauge", etc.

This discussion applies to MySQL and MariaDB.

Sample schemas


This first schema assumes each metric is a separate row:
CREATE TABLE Readings (  -- Example of multiple rows for each site
    thing_id SMALLINT UNSIGNED NOT NULL,
    dt TIMESTAMP NOT NULL,  -- `TIMESTAMP` avoids confusion on nights when shifting to/from daylight savings
    sensor_id SMALLINT UNSIGNED NOT NULL,
    value FLOAT NOT NULL, -- `FLOAT` (4 bytes) .. `DECIMAL(6,4)` (3 bytes), etc
    PRIMARY KEY (thing_id, dt, sensor_id)  -- (need to see queries to pick optimal order)
    INDEX(dt),
    INDEX(sensor_id, dt)   -- if you need to look at all of one type of sensor
) ENGINE=InnoDB

Using 4 columns instead of 4 rows will cut the disk footprint by a factor of 2 or 3. So, it is often better to use the follow, assuming one input delivers multiple metrics for one thing at one time, and they are put into a single row:
CREATE TABLE WeatherReadings (
    location_id SMALLINT UNSIGNED NOT NULL, -- 2 bytes (max of 65K); JOIN to another table
    dt TIMESTAMP NOT NULL,
    temp DECIMAL(4,1) NOT NULL -- 3 bytes good range for Fahrenheit, acceptable precision
    --   DECIMAL(3,1) or (4,2) for centigrade is only 2 bytes
    humidity DECIMAL(2,0) or TINYINT -- 1 byte; `DECIMAL`(3,1)` -- 2 bytes
    rain DECIMAL (4,2)  -- 2 bytes; sufficient range for 1 hour, but not for one day
    wind_speed  ??
    wind_bearing  SMALLINT NOT NULL -- 2 bytes 0-359
    ...
    PRIMARY KEY (location_id, dt),
    INDEX(dt)
) ENGINE=InnoDB
CREATE TABLE VehicleReadings (
    vehicle_id SMALLINT UNSIGNED NOT NULL, -- 2 bytes (max of 65K)
    dt TIMESTAMP NOT NULL,
    engine_on TINYINT NOT NULL, -- 1 byte; 0=OFF,1=ON
    latitude  FLOAT NOT NULL    -- 4 bytes
    longitude FLOAT NOT NULL    -- 4 bytes
    -- Note: that lat/lng has a resolution of 1.7 m / 5.6 ft (or better) anywhere in world
    speed ... SIGNED NOT NULL,  -- 4 bytes FLOAT, 3 bytes DECIMAL(4,1) for mph or km/h
    direction SMALLINT NOT NULL -- 2 bytes 0-359
    flags SET('wipers_on', 'lights_on', ...)  -- 1 byte per 8; not cheap to search
    obd_errors VARCHAR(999) NOT NULL,  -- usually empty
    ...
    PRIMARY KEY (vehicle_id, dt),
    INDEX(dt),
    FULLTEXT(obd_errors)        -- might be useful
) ENGINE=InnoDB
CREATE TABLE StockQuotes (
    ticker VARCHAR(5) NOT NULL,     -- <= 6 bytes
    dt TIMESTAMP NOT NULL,          -- or just DATE (3 bytes) for daily quotes
    quote ?? -- Old quotes where in fractions; most new quotes are on 100ths, but some penny stocks need more
    open, close, high, low  -- useful on daily (in place of quote); also in summary
    volume ?? -- 'indexes' may exceed size of 'INT'.
    PRIMARY KEY (ticker, dt),
    INDEX(dt)
) ENGINE=InnoDB CHARACTER SET ascii   -- at least for the tickers

FLOAT(10,8) is folly. FLOAT has only 24 bits of precision. Just say FLOAT or switch to a suitable DECIMAL(...). DOUBLE has a lot more precision, but takes a lot more space (8 bytes) and is rarely useful for anything that a sensor can measure.

SUM will convert to DOUBLE, so SUM(rain) will not overflow even if individual readings are constrained. A summary table would need a datatype bigger than DECIMAL (4,2) for storing, say, annual rainfall.

The choice of 'units' will affect the specification of DECIMAL and may impact the number of bytes. Perhaps it is best to settle on the obvious metric units and convert as you store. FLOAT has sufficient range and precision for any sensor I can think of.

A 1-byte difference adds up to a 1GB difference for a billion rows; double that if it is in an index.

Lookup / Normalization


The sensor might be better called sensor_id, with another table mapping sensor_id to the name, location, type, brand, etc, of the particular sensor.

Indexing


When data arrives, it probably includes multiple sensor readings for a single timestamp. If you use the one-reading-per-row schema and have PRIMARY KEY (sensor, dt), then the inserts will go into several different spots in the table. (Keep in mind that the data is in a BTree ordered by the PK.) Even if you have a thousand sensors, this is not a burden on the system. There will be a thousand "hot spots"; inserts will be efficient.

If there are queries that look up all the readings for all sensors for a day, then add INDEX(dt) to facilitate such. Note that partition pruning (qv) will help (somewhat) with this, even if you don't have an INDEX.

If there is any chance for two readings for the same sensor in the same second, then decide

    ⚈  That is an error.
    ⚈  Take the first one by using INSERT IGNORE.
    ⚈  Take the last one by using INSERT ... ON DUPLICATE KEY UPDATE ... ("IODKU"/"upsert")
    ⚈  Don't use REPLACE, it probably takes longer since it is DELETE + INSERT. That is because it always tries to Delete.
    ⚈  Use fractional seconds as a partial remedy. (DATETIME(6) and TIMESTAMP(6) gives you microseconds but takes 8 bytes. DATETIME(2) and TIMESTAMP(2) takes only 1 extra byte over the base datatype.)

The Fact table should have very few indexes. Most other needs can be achieved by extra indexes on Summary tables.

Writing a date (or datetime) range:

SELECT *
    FROM Readings
    WHERE dt >= '2021-02-28'
      AND dt  < '2021-02-28' + INTERVAL 1 DAY

Note how that range works correctly for DATE, DATETIME, or TIMESTAMP, and handles leap days easily.

With DST (Daylight Stutter Time), once a year, DATETIME loses an hour; once a year, it repeats an hour. Hence, my recommendation to use TIMESTAMP instead.

Summarization


Build another table, a "Summary Table", to contain summary info for the minute or hour or day (depending on the frequency of data arrival and reporting needs).

    ⚈  Every second -- summarize by minute
    ⚈  up to 1 minute -- by hour
    ⚈  less frequent readings -- by day

The goal is to fold at least 10 'raw' rows into one 'summarized' row. Daily summarization can be rolled up into weekly summaries, etc. However, this is usually not needed.

Your "Fact" table (the Readings mentioned above) is likely to grow to a billion rows; this has challenges. Much of what this blog says is aimed at mitigating that.

I will assume that readings are received quite regularly. This lets you say AVG(value) when summarizing the hourly temperatures into a daily summary.

If the readings are sporadic, then it is probably more appropriate to keep sum_value from SUM(value) and COUNT(*) in ct, then compute, in the SELECT, the average as SUM(sum_value) / SUM(ct), just as will be done when rolling up to broader periods.
CREATE TABLE Summary (
    sensor_id SMALLINT UNSIGNED NOT NULL,  -- match entry in `Readings`
    sensor_name VARCHAR(...) NOT NULL,  -- match lookup value
    dy DATE NOT NULL,          -- assuming daily summarization
    ct,                        -- how many readings were taken
    sum_value FLOAT NOT NULL,  -- or avg_value; see discussion
    hi FLOAT NOT NULL,         -- from MAX(value)
    lo FLOAT NOT NULL,
    PRIMARY KEY (sensor, dy),  -- no "id"
    INDEX(dy)                  -- if you might need "all sensors over a date range"
) ENGINE=InnoDB

Notes:

    ⚈  sensor_name -- it is OK to "denormalize" the Summary table. (Purists will argue with me.)
    ⚈  dy DATE if summarizing to a "day", else DATETIME or TIMESTAMP if using a smaller unit
    ⚈  To, for example, round to the hour: LEFT(MIN(dt), 13) or other calculations. (More discussion below.)
    ⚈  hi and lo are MAX(value) and MIN(value) in case you want to see, for example, the temperature range for a day. (Part of a "candlestick" in stock quotes.)
    ⚈  median and "time of hi" are more complex to compute. MySQL 8.0 and MariaDB 10.4 have a MEDIAN() function, but otherwise, I recommend writing some client code to compute such things.
    ⚈  If using DECIMAL, be sure to allow more space so that it does not overflow.
    ⚈  DOUBLE (8 bytes) is overkill for most "sensor" data. FLOAT (4 bytes; about 7 significant digits) is usually more than sufficient, even for, say, rainfall for a year.
    ⚈  Do not use (m,n) on FLOAT or DOUBLE. (MySQL 8.0.17 deprecates it.)
    ⚈  Contrary to the discussion above, FLOAT (with 4 bytes) is likely to be better than DECIMAL, especially for values that are SUMmed.

(More discussion of Summary tables below.)

Size


Sensor data tends to be huge. So, it is best to minimize datatype sizes when initially designing the schema. This may avoid rapidly running out of disk space. And it will help avoid the pain of later doing ALTER TABLE on a huge table.

Do you really need data that is ten years old? You can probably live with the summary data. That is, you should seriously consider purging "old" raw data (in the "Fact" table), but keep the Summary Table(s) 'forever'. After all, the latter will be about one-tenth the size of the raw data.

And if you started with the PARTITIONing mentioned, you are ready for an efficient way to purge old raw data.

PARTITIONing for Purging old data


If you will hold all the data forever, use a single table. If you will purge old data, use PARTITION BY RANGE.

_Partitioning is not likely to provide any performance advantage for SELECTs.

Manually splitting the data into tables by the year is even more clumsy and ill-advised (compared to Partitioning).

To install PARTITIONing after you have a billion rows will probably lock the table for hour(s). You would need some way to stall (or lose) INSERTs while doing it.

A better way to deal with the mass of data is to summarize all the data and keep only "recent" raw data. At that point, PARTITION BY RANGE(...) is the way to go. That allows DROP PARTITION to do the heavy lifting of DELETE without taking much time. That is the main way that Partitioning can help.

There is rarely a need to PARTITION a summary table. It is usually practical to keep that data 'forever'.

Note: When changing to/from PARTITIONing, you must rethink all the indexes, including the PRIMARY KEY. Given that the table is PARTITION BY RANGE(TO_DAYS(dy)), this is likely to be still valid: PRIMARY KEY(sensor, dy), except that dy now is the partition by value (start of day/hour/week/whatever). There is probably no need for INDEX(dy) since partitioning provides that (mostly).

More:
Partitioning

Y2K / Y2038


TIMESTAMPs will run out of steam in 2038. DATETIMEs have a problem at 2am-3am once a year when DST changes. y2038 -- Well, that's 15 years off. Think about the computer you had 15 years ago. Is it still alive and being used? Instead, you should plan to replace all hardware and software every 5 years or so, or else it becomes nearly impossible to upgrade the server + network + OS + database + app, etc.

Zeros


Some readings (eg, rainfall) are often zero. It is likely to be best to not store any zero values.

Some readings depend on other things. For example, "distance traveled" for a vehicle is (hopefully) meaningless when the engine is off. Since you will be recording several readings every few seconds, it might be best not to store most of them if the engine is off.

Note that AVG(rainfall) or AVG(speed) will differ if zeros are missing versus present. This could be a strong reason for collecting SUM(rainfall) instead of AVG. And think through what SUM(ct) should be.

Multiple Sensors


Plan A: One reading per row. (This is what I am focusing on in this document.) This gets in trouble when the values are not all numeric, but instead "On/Off", "transmission in Drive", etc. In this case, leave out the rows rain=0. Then use LEFT JOIN and COALESCE (or IFNULL) to retrieve "0" from a missing row.

Plan B: Multiple columns (eg, temp, rain, wind_speed, wind_direction) in a single table. In this case, store "0", not "NULL" when tehre is no rain.

You should decide what makes sense in your situation. Plan B saves a lot of space but does not have much flexibility. Adding a new sensor is costly (very slow ALTER), though not too bad in MySQL 8.0's ALTER TABLE.

Summarize Immediately?


By using IODKU (upsert), you can keep the Summary Table(s) up to date continually. For really rapid ingestion, there may be an issue of keeping up. (This is not likely to be a problem for Sensor data. More discussion about ingestion speed:
High speed ingestion
INSERT INTO Summary (sensor, dy, value, ct, sum_value, lo, hi)
    VALUES (
        ?,       -- sensor
        DATE(?), -- or another mapping to the summary table's resolution
        ?,       -- value
        1,       -- for ct
        ?,       -- value  (for sum_value)
        ?,       -- value  (for open)
        ?,       -- value  (for close)
        ?,       -- value  (for lo)
        ?,       -- value  (for hi)
    ON DUPLICATE KEY UPDATE
        ct = ct + 1,
        sum_value = sum_value + VALUES(value),
        -- no action for open; the 'first' value is "open"
        close = VALUES(value),
        lo = LEAST(lo, VALUES(lo)),
        hi = GREATEST(hi, VALUES(hi)) ;

MySQL 8 is moving toward a new syntax: ON DUPLICATE KEY UPDATE ct = ct + 1, sum_value = sum_value + NEW.value, -- no action for open; the 'first' value is "open" close = NEW.value, lo = LEAST(lo, NEW.lo), hi = GREATEST(hi, NEW.hi) ;

open and close assume that the readings arrive in chronological order.

Convert datetime to hour: FROM_UNIXTIME(FLOOR(UNIX_TIMESTAMP(dt) / 3600) * 3600)

For sensor data, consider min, max, avg.

For vehicle monitoring, consider throwing a bunch of exception flags into a SET or INT of a suitable size. (By that, I mean SMALLINT, not INT(5). The '(5)' is meaningless and deprecated.)

Summarize Nightly?


This is more efficient, but the day's data may not be easily available. However, it gets tricky if the data is delayed for a day and/or you need columns like open and close.

Multiple summary tables?


For summarizing different things, you may need multiple tables. For summarizing different time spans, usually one summary table is enough. If not, roll up, say, the hourly summary table to get the weekly table. (There are tradeoffs)

Other statistics


MEDIAN and other Percentiles -- Upgrade to MySQL 8.0 or MariaDB 10.3. There may be enough functions to graph stock "candlesticks".

This computation can be rolled up if you keep COUNT(*), SUM(value), SUM(value*value)

Graphing


Graphing a million points of data will choke the typical graphing product. But graphing, say, hourly summary info over the span of a month is quite reasonable.

In many situations, just the "last" or "average" value for each unit of time is "good enough" for graphing. In other situations, a "candlestick" graph is called for. This involves (at least) the MAX and MIN.

Note that MAX(max) and MIN(min) are useful when summarizing a summary. But AVG(avg) may is not mathematically correct; instead, use SUM(sum)/(SUM(ct). On the other hand, what should you do if a few readings are missing from a day's temperature readings?

Some numbers: There are 1440 minutes in a day and about 30M seconds in a year. When designing a graph/chart on a device with, say 1000 pixels of width. One reading per minute would fit nicely in a day's chart. But 30M points for a year would be gross overkill; think about a summary table.

max_connections


If the sensor data is coming reasonably constantly and takes very little time to process, then throttle the connections. For example, set max_connections=100 and back_log=1000. If a burst occurs, this should delay connections rather than clog up the server.

Consider what would happen if you let 10000 people into a grocery store at the same time. Traffic would be so clogged up that people might take an hour to buy just one item.

Postlog


Pre-released in 2022; cleaned up July, 2023

References


My discussion of Summary Tables
My discussion of Sensor data collected from vehicles
Summarization
Summary Tables
An example

Zillow lookalike (Real Estate)

50 sensors every 5 seconds

250M rows; 1500 sensors

Financial -- History+Current

Auto_inc vs natural PK

Vehicle monitoring (500/second)

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
    Data Warehouse Overview   Summary Tables   High speed ingestion   Bulk Normalization  

Schema and code design for large Sensor database

Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative

5 methods for 'Find Nearest'

Lat/Lng search to Find the nearest 10 pizza parlors
    Lat/Long representation choices

Z-Order 'find nearest'

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Efficient List of Latest 10 news articles

Build and execute a Pivot SELECT (showing rows as columns)

(Groupwise Max): Efficiently find largest row(s) for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
    Many-to-many mapping table   Handler counts   wp_postmeta   UNION+OFFSET

MySQL Limits -- built-in hard limits
    767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte and bagel ($10) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"