Best Practices for Datatypes in a MySQL Schema

Table of Contents

InnoDB

INT Datatypes

FLOAT and DOUBLE Datatypes

DECIMAL Datatypes

String Datatypes

VARCHAR vs CHAR vs TEXT

DATE and TIME

ENUM Datatype

ENUM vs TINYINT vs VARCHAR

SET Datatype

Other Datatypes

NULL vs. NOT NULL

Empty vs. NULL

DEFAULT

GENERATED columns

PRIMARY KEY

Natural versus Surrogate PRIMARY KEY

Many-to-many Mapping

Secondary Keys

FOREIGN KEYs

One-to-one Mapping / Hierarchy

PARTITIONing

SHOW CREATE TABLE is more descriptive than DESCRIBE

What to INDEX

Normalize

Computing row size

Postlog

Brought to you by Rick James


UNDER CONSTRUCTION

InnoDB

This blog mostly assumes you are using InnoDB, which is the preferred Engine.

When initially creating a schema, especially one for a large table, be sure to minimize the size of each column. Changing a BIGINT to INT will save 4 gigabytes in a billion-row table. (Or more, if it is used in an index.)

The goal of this blog is to help you pick the best datatypes, and to give you tips related to the choices.

INT Datatypes

Use the smallest type practical. Use UNSIGNED where appropriate, which is most of the time.
    Datatype bytes  SIGNED         UNSIGNED
    -------- -----  -----------    ----------
    BIGINT       8  ~+/-9e18       0..~18e18
    INT          4  ~+/-2 billion  0..~4 billion
    MEDIUMINT    3  ~+/-8 million  0..~16 million
    SMALLINT     2  -32768..+32767 0..65536
    TINYINT      1  -128..+127     0..255

    ⚈  INT(5) is 4 bytes; the '5' means nothing except...
    ⚈  zipcode INT(5) ZEROFILL would give you leading zeros for US zipcodes.

FLOAT and DOUBLE Datatypes


    ⚈  FLOAT has about 7 significant digits and a range of about 10^-38 to 10^+38.
    ⚈  DOUBLE has about 16 significant digits and a range of about 10^-308 to 10^+308.
    ⚈  Do not use (m,n) on FLOAT or DOUBLE; it only leads to an extra rounding between binary and decimal. If you are tempted to use such, then switch to the DECIMAL datatype.
    ⚈  When displaying, consider using ROUND() or FORMAT() for limiting the number of decimal places. FORMAT also adds "thousands separators" and optionally understands 'locale' differences such as . versus ,.

DECIMAL Datatypes


    ⚈  Use DECIMAL for monitary values.
    ⚈  DECIMAL(m,n) occupies approximately m/2 bytes.
    ⚈  DECIMAL is always SIGNED.
    ⚈  There is a limit of (m-n) digits to the left of the decimal point. Hence DECIMAL(8,2) allows up to a million dollars or Euros, to the cent.

String Datatypes


    ⚈  Use CHAR only when the string really is a fixed length.
    ⚈  VARCHAR is preferred over TINYTEXT/TEXT. There is no practical use of TINYTEXT instead of an equivalent VARCHAR.
    ⚈  LONGTEXT is likely to run into various non-string limits (packet_size, etc) long before you can get 4 billion bytes.
    ⚈  The number in CHAR and VARCHAR is 'characters', not 'bytes'. The number in variations of TEXT is 'bytes'.
    ⚈  CHARACTER SET utf8 takes up to 3 bytes per character for VARCHAR and TEXT.
    ⚈  For some ROW_FORMATs in InnoDB, CHAR(10) CHARACTER SET utf8 will occupy only 10 bytes. But don't depend on such.
    ⚈  Don't blindly say VARCHAR(255)`; use a reasonable limit. (Else tmp tables, index limits, etc may have issues.)
    ⚈  See "Other Datatypes" below for country_code, IP addresses, etc.
    ⚈  Do not put strictly numeric values into VARCHAR; WHERE clauses may suffer. In particular WHERE varchar_col = 1234 will convert each row's varchar_col to numeric to perform the comparison.
    ⚈  Use CHARACTER SET ascii (or latin1) for hex, various codes, etc, where no accents are ever possible.
    ⚈  Use CHARACTER SET utf8mb4 for general text; this provides full international support. In particular, utf8 is incomplete for Emoji and Chinese.
Character storage

VARCHAR vs CHAR vs TEXT


    ⚈  In almost all cases, VARCHAR is preferrable to CHAR.
    ⚈  TEXT (of various sizes) is necessary when you get beyond VARCHAR's limit.
    ⚈  Use CHAR for truly fixed-length columns.
    ⚈  Almost always fixed-length columns are ascii, so use CHARACTER SET ascii (or latin1).
    ⚈  If you will be storing text in multiple languages, be sure to have CHARACTER SET utf8mb4. (utf8 is a compromise in older versions.)

VARCHAR vs TEXT

Examples:

    ⚈  country_code CHAR(2)
    ⚈  postal_code CHAR(6) -- this works for Britain, but probably not for the world in general
    ⚈  uuid, guid, md5, sha256, etc -- because they are hex, plus -
    ⚈  ipv4, ipv6 -- decimal or hex, plus limited puncutuation.
    ⚈  phone numbers -- In US, you can use CHAR, but for international, use VARCHAR.

DATE and TIME


    ⚈  Date_format(Now(), '%Y-%m-%d') is the same as CURRENT_DATE().
    ⚈  Don't split a DATETIME into two columns (a DATE and a TIME) -- it is usually easier to pick apart a DATETIME than to manipulate two columns
    ⚈  DATETIME is like a picture of the clock on your wall (in your timezone); TIMESTAMP is an instant in time.
    ⚈  TIMESTAMP is stored in UTC, and converted from local time when storing, and converted to localtime when fetching.
    ⚈  If you need more control over timezones, you are on your own.
    ⚈  (n) can be added to DATETIME/TIME/TIMESTAMP to get "fractional seconds" as low as microseconds (6).
    ⚈  Do not use any time-oriented column as a UNIQUE (or PRIMARY) key unless you have business logic to confirm that two rows with identical values cannot coexist.
    ⚈  DATE and DATETIME literals should be strings. You do not need to cast a string to a DATE (etc). That is, do not do CAST('2016-07-14' AS DATE) or DATE('2016-07-14'). More discussion:
stackoverflow
    ⚈  When expressing a date range, don't worry about leap years, number of days in a month, etc; instead express it thus:
    AND `Date` >= '2016-02-24'
    AND `Date`  < '2016-02-24' + INTERVAL 1 WEEK

ENUM Datatype


There are religious battles waging over whether to ever use ENUM. Here are some tips if you choose to use EHUM:

    ⚈  ENUMs are stored internally as a 1- or 2-byte number.
    ⚈  Advantage: You use strings instead of numbers; this makes code more readable.
    ⚈  Up to 255 options: an ENUM is only one byte.
    ⚈  Typically it is unwise to have an ENUM "nullable". Instead, make the first option something like 'unknown' or 'N/A'.
    ⚈  Don't have more than a very few options in an ENUM.
    ⚈  If you do need to add an option to an ENUM, the ALTER is essentially free if you add the new option on the end, and don't exceed 255 options.
    ⚈  Due to low cardinality, it is rarely useful to index an ENUM by itself. However, it may be quite useful to use it in a 'composite' (multi-column) index.
    ⚈  If I insert noon in a DATETIME, it will say noon for you wherever you are. But inserting noon into a TIMESTAMP will show up differently on the other side of the world. Which effect do you want?
    ⚈  Example of an ENUM with a lot of optiona: rating ENUM('unrated', 'G', 'PG', 'PG-13', 'R', 'NC-17') DEFAULT 'unrated'

ENUM vs TINYINT vs VARCHAR


Suppose you have
CREATE TABLE ( ...
    col_enum ENUM('yes', 'no', 'maybe'),
    col_tiny TINYINT,     -- 0='yes', 1='no', 2='maybe'
    col_char VARCHAR(5),  -- 'yes', 'no', 'maybe'
    ,,,
What are the pros/cons of those 2 choices?

    ⚈  col_char is the bulkiest;
    ⚈  col_tiny gives no clue of "meaning";
    ⚈  col_enum cannot be extended to now values without doing an ALTER

You pick what you like. There is no solid winner.

SET Datatype


    ⚈  Rarely used.
    ⚈  Think of it as a multi-valued ENUM.
    ⚈  A SET cannot have more than 64 options.
    ⚈  It occupies CEIL(n/8) bytes, where n is the number of options.
    ⚈  It may be easier to use some size of INT instead of a SET, especially if you do boolean arithmetic.
    ⚈  It is not useful to index a SET column.
    ⚈  Example (refering to movies): special_features SET('Trailers', 'Commentaries', 'Deleted Scenes', 'Behind the Scenes')

Other Datatypes


    ⚈  Latitude/Longitude: DOUBLE is grossly excessive. See this for other options:
Representations
    ⚈  IPv4 is no bigger than 123.123.123.123 -- VARCHAR(15) CHARACTER SET ascii, which can be converted to BINARY(4) via inet_aton(). The former can be compared only for [in]equality; the latter can be used for ranges.
    ⚈  IPv6 is no bigger than ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff -- VARCHAR(39) CHARACTER SET ascii, which can be converted to BINARY(16) via inet6_aton(). The former can be compared only for [in]equality; the latter can be used for ranges.
    ⚈  Money: See DECIMAL, but be aware that different countries need different number of decimal places; 4 might be the current max. As for leading digits -- that is rather large, and growing daily.
    ⚈  Zipcode -- US zip-5 could be stored in CHAR(5) CHARACTER SET ascii or MEDIUMINT(5) UNSIGNED ZEROFILL. (Note the very rare usage of ZEROFILL.) Zip-9, Postal Codes, etc probably deserve VARCHAR(n) CHARACTER SET ascii with a suitable n.
    ⚈  GUID/UUID -- These need CHAR(36) CHARACTER SET ascii. Better yet, convert to BINARY(16) with functions available in MySQL 8.0 or found in UUID functions
    ⚈  MD5/SHA1/etc -- These are like GUIDs/UUIDs, but need different sizes and can be converted via HEX() and UNHEX().
    ⚈  JSON -- MySQL and MariaDB vary on this.

NULL vs. NOT NULL


    ⚈  Unfortunately, NULL is the default.
    ⚈  If NULLable takes extra space, it is probably minuscule compared to the rest of the space taken. So, I don't make the argument on space.
    ⚈  A PRIMARY KEY cannot contain any nullable columns.
    ⚈  A UNIQUE key can contain a nullable column.
    ⚈  Use NOT NULL unless there you need an 'out of band' NULL value.

Some uses for NULL:
    ⚈  Don't have the value yet.
    ⚈  "Not applicable"
    ⚈  Value (or reference to other table) has been removed
    ⚈  "Unknown"

Empty vs. NULL


one discussion

DEFAULT


(except AUTO_INCREMENT, TIMESTAMP...)

GENERATED columns


If PERSIST, then they take space.

PRIMARY KEY


    ⚈  Always explicitly provide one. While not an absolute requirement, it signals that you have thought about what the PK should be.
    ⚈  Each secondary key (in InnoDB) contains a copy of the PK; hence,...
    ⚈  A bulky PK leads to bulky secondary indexes, thereby leading to more disk space.
    ⚈  The implied rule above is "keep the PK small". In reality, the that rule only applies if you have two or more secondary keys. At one secondary key, the space is pretty much a wash.
    ⚈  A PK can either be "natural" or "surrogate"...

Natural versus Surrogate PRIMARY KEY

    ⚈  A "natural" PK is a column (or set of columns) that are 'naturally' unique.
    ⚈  A "surrogate" PK is (usually) INT UNSIGNED AUTO_INCREMENT. (The INT can be any of its sizes.) (UNSIGNED gives you double the range for free.)
    ⚈  I find that more than half of tables do have column(s) that could be the "natural" PK.
    ⚈  Some platforms automatically provide a surrogate PK for all tables. This can hurt performance in some cases. And it can waste disk space.

Many-to-many Mapping


    ⚈  An important pattern where "natural" is better than "surrogate" is in a many-to-many mapping table, which has 2 main columns, an those columns together are unique. Hence, that makes a good PK. The opposite ordering makes a good composite index for going the other way. See
many:many

Secondary Keys


    ⚈  Each secondary key is a separate B+Tree containing the column(s) of the secondary key, plus the column(s) of the PK.
    ⚈  I believe dup columns are removed.
    ⚈  Fetching a record using a secondary key involves first drilling down its BTree, then drilling down the BTree that contains the data (and is ordered by the PK).
    ⚈  If all the columns in a SELECT are found in the secondary key (as augmented by the PK), then the key is said to be "covering". EXLAIN show it by saying Using index (not 'Using index condition'). This gives a slight (sometimes 2x) performance boost.
    ⚈  When taking advantage of 'covering', it is best to spell out the columns you need. This lets the reader know that you planned for such.
    ⚈  When you have INDEX(a), INDEX(a,b), toss the former as redundant. (And, in at least one case, it trips up the Optimizer.)

FOREIGN KEYs


    ⚈  FKs are good for maintaining the integrity of links between tables.
    ⚈  FKs cost a small extra effort to do the integrity check.
    ⚈  FKs need an index.

If you have trouble establishing FKs (error 150, 23000, 23001, etc), do one of these:

    ⚈  Disable FKs, CREATE TABLEs, re-enable FKs
    ⚈  Apply the FKs via ALTER after doing the relevant CREATE TABLEs
    ⚈  (I forget the 3rd.)

One-to-one Mapping / Hierarchy


    ⚈  A "1:many" mapping is easily implemented by having the id for the "one" in the row of the "many".
    ⚈  A table representing a hierarchical tree has columns id and parent_id, plus PRIMARY KEY(id), INDEX(parent_id). The "root" of the tree usually has parent_id=0. See Recursive CTEs for traversing such trees.

PARTITIONing


    ⚈  Partitioning is rarely useful.
    ⚈  When switching between partitioned and not partitioned, all the indexes of the table should be reviewed are likely to be changed
    ⚈  More discussion:
Partitioning

SHOW CREATE TABLE is more descriptive than DESCRIBE


    ⚈  "MUL" does not tell you whether indexes are single-column or composite.
    ⚈  FOREIGN KEYs are ignored.
    ⚈  ENGINE is unavailable
    ⚈  etc, etc.
    ⚈  So, never use DESCRIBE TABLE, only use SHOW CREATE TABLE !

What to INDEX


    ⚈  "But I indexed every column" == Novice who does not understand indexing.
    ⚈  With rare exceptions, you need to develop the SELECTs (and UPDATEs and DELETEs) to see what indexes are needed for a table.
    ⚈  In many cases, "composite" indexes are the key to performance.
    ⚈  For a lot more discussion see:
Index Cookbook

Normalize


    ⚈  Because the textbook says to -- Novice.
    ⚈  To centralize a value that could change -- The goal of the textbook.
    ⚈  To save space, hence improve performance -- The goal of a DBA.
    ⚈  But, do not normalize 'continuous' values (numbers, floats, dates, etc)

Computing row size


This is tricky. There are many hidden fields in a record in an InnoDB table and block and record. This will give you an approximation.

    1.  For each numeric column, add the number of bytes given above.
    2.  For each string column, add an estimate of the average length of the string. (This will go awry for big text/blob columns that end up "off-record")
    3.  Multiple by 3. This very crudely accounts for the overhead, padding, waste, indexes, etc.

Another way is to

    1.  CREATE TABLE ...
    2.  Populate it will many rows. How many rows? This gets tricky because the table will grow is spurts.
    3.  SHOW TABLE STATUS and look at Avg_row_length.

Postlog

man page on data size

Original posting: Aug, 2019 (after many years of assimilating ideas)


MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
    Overview   Summary Tables   High speed ingestion  

Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative

5 methods for 'Find Nearest'

Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
    Lat/Long representation choices

Z-Order 'find nearest'(under construction)

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Latest 10 news articles -- how to optimize the schema and code for such

Build and execute a "Pivot" SELECT (showing rows as columns)

Find largest row for each group ("groupwise max")

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
    Many-to-many mapping table   wp_postmeta   UNION+OFFSET

MySQL Limits -- built-in hard limits
    767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte ($4) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"