Best Practices for Datatypes in a MySQL Schema

Table of Contents

InnoDB

INT Datatypes

FLOAT and DOUBLE Datatypes

DECIMAL Datatypes

String Datatypes

VARCHAR vs CHAR vs TEXT

DATE and TIME

ENUM Datatype

ENUM vs TINYINT vs VARCHAR

SET Datatype

Other Datatypes

NULL vs. NOT NULL

Empty vs. NULL

DEFAULT

GENERATED columns

Functional Indexes

PRIMARY KEY

Natural versus Surrogate PRIMARY KEY

Many-to-many Mapping

Secondary Keys

FOREIGN KEYs

One-to-one Mapping / Hierarchy

PARTITIONing

SHOW CREATE TABLE is more descriptive than DESCRIBE

What to INDEX

Loading Data

Normalize

Computing row size

Single/double quotes, backtics, etc

PHP functions

Naming

Images

Final note

Postlog

Brought to you by Rick James


UNDER CONSTRUCTION (but usable)

InnoDB

This blog mostly assumes you are using InnoDB, which is the preferred Engine.

When initially creating a schema, especially one for a large table, be sure to minimize the size of each column. Changing a BIGINT to INT will save 4 gigabytes in a billion-row table. (Or more, if it is used in an index.)

The goal of this blog is to help you pick the best datatypes, and to give you tips related to the choices.

INT Datatypes

Use the smallest type practical. Use UNSIGNED where appropriate, which is most of the time.
    Datatype bytes  SIGNED         UNSIGNED
    -------- -----  -----------    ----------
    BIGINT       8  ~+/-9e18       0..~18e18
    INT          4  ~+/-2 billion  0..~4 billion
    MEDIUMINT    3  ~+/-8 million  0..~16 million
    SMALLINT     2  -32768..+32767 0..65536
    TINYINT      1  -128..+127     0..255

    ⚈  MEDIUMINT(5) is 3 bytes; the '5' means nothing except...
    ⚈  zipcode MEDIUMINT(5) ZEROFILL would give you leading zeros for US zipcodes. (Ditto for FIPS)
    ⚈  Note: MySQL 8.0.19 has removed the spurious length field except when using ZEROFILL.

FLOAT and DOUBLE Datatypes


    ⚈  FLOAT has about 7 significant digits and a range of about 10^-38 to 10^+38.
    ⚈  DOUBLE has about 16 significant digits and a range of about 10^-308 to 10^+308.
    ⚈  Do not use (m,n) on FLOAT or DOUBLE; it only leads to an extra rounding between binary and decimal. If you are tempted to use such, then switch to the DECIMAL datatype.
    ⚈  When displaying, consider using ROUND() or FORMAT() for limiting the number of decimal places. FORMAT also adds "thousands separators" and optionally understands 'locale' differences such as . versus ,.
    ⚈  Do not use FLOAT or DOUBLE for money; see DECIMAL.

DECIMAL Datatypes


    ⚈  Use DECIMAL for monetary values.
    ⚈  DECIMAL(m,n) occupies approximately m/2 bytes.
    ⚈  DECIMAL is always SIGNED.
    ⚈  There is a limit of (m-n) digits to the left of the decimal point. Hence DECIMAL(8,2) allows up to a million dollars or Euros, to the cent.

String Datatypes


    ⚈  VARCHAR is preferred over TINYTEXT/TEXT. There is no practical use of TINYTEXT instead of an equivalent VARCHAR.
    ⚈  LONGTEXT is likely to run into various non-string limits (packet_size, etc) long before you can get 4 billion bytes.
    ⚈  CHARACTER SET utf8 takes up to 3 bytes per character for VARCHAR and TEXT. And utf8mb4 takes up to 4 bytes per character.
    ⚈  For some ROW_FORMATs in InnoDB, CHAR(10) CHARACTER SET utf8 will occupy only 10 bytes. But don't depend on such.
    ⚈  Don't blindly say VARCHAR(255)`; use a reasonable limit. (Else tmp tables, index limits, etc may have issues.)
    ⚈  See "Other Datatypes" below for country_code, IP addresses, etc.
    ⚈  Do not put strictly numeric values into VARCHAR; WHERE clauses may suffer. In particular WHERE varchar_col = 1234 will convert each row's varchar_col to numeric to perform the comparison (slow!)
    ⚈  Use CHARACTER SET ascii (or latin1) for hex, various codes, etc, where no accents are ever possible. (Collate _general_ci in most cases)
    ⚈  Use CHARACTER SET ascii COLLATE ascii_bin for Base64 since different cases mean different things. (BINARY would also work.)
    ⚈  Use CHARACTER SET utf8mb4 for general text; this provides full international support. In particular, utf8 is incomplete for Emoji and Chinese.
Character storage
VARHAR(255)
Suggested character set for non utf8 columns in mysql

VARCHAR vs CHAR vs TEXT


    ⚈  In almost all cases, VARCHAR is preferrable to CHAR.
    ⚈  TEXT (of various sizes) is necessary when you get beyond VARCHAR's limit.
    ⚈  Use CHAR for truly fixed-length columns.
    ⚈  Almost always fixed-length columns are ascii, so use CHARACTER SET ascii (or latin1).
    ⚈  If you will be storing text in multiple languages, be sure to have CHARACTER SET utf8mb4. (utf8 is a compromise in older versions.)
    ⚈  The number in CHAR and VARCHAR is 'characters', not 'bytes'. The number in variations of TEXT is 'bytes'.

VARCHAR vs TEXT

Examples:

    ⚈  country_code CHAR(2) -- If you want access to the full country name, have a Countries table with (at least) country_code (the PK) and name.
    ⚈  postal_code CHAR(6) -- this works for Britain, but probably not for the world in general
    ⚈  uuid, guid, md5, sha256, etc -- because they are hex, plus -
    ⚈  ipv4, ipv6 -- decimal or hex, plus limited puncutuation.
    ⚈  phone numbers -- In US, you can use CHAR, but for international, use VARCHAR.

DATE and TIME


    ⚈  Date_format(Now(), '%Y-%m-%d') is the same as CURRENT_DATE().
    ⚈  Don't split a DATETIME into two columns (a DATE and a TIME) -- it is usually easier to pick apart a DATETIME than to manipulate two columns
    ⚈  DATETIME is like a picture of the clock on your wall (in your timezone); TIMESTAMP is an instant in time.
    ⚈  TIMESTAMP is stored in UTC, and converted from local time when storing, and converted to localtime when fetching.
    ⚈  If you need more control over timezones, you are on your own.
    ⚈  (n) can be added to DATETIME/TIME/TIMESTAMP to get "fractional seconds" as low as microseconds (6).
    ⚈  Do not use any time-oriented column as a UNIQUE (or PRIMARY) key unless you have business logic to confirm that two rows with identical values cannot coexist. A possible example: A sensor that regurlary reports a reading every 10 seconds.
    ⚈  If I insert noon in a DATETIME, it will say noon for you wherever you are. But inserting noon into a TIMESTAMP will show up differently on the other side of the world. Which effect do you want?
    ⚈  DATE and DATETIME literals should be strings. You do not need to cast a string to a DATE (etc). That is, do not do CAST('2016-07-14' AS DATE) or DATE('2016-07-14'). More discussion:
stackoverflow
    ⚈  When expressing a date range, don't worry about leap years, number of days in a month, etc; instead express it thus:
    AND `Date` >= '2016-02-24'
    AND `Date`  < '2016-02-24' + INTERVAL 1 WEEK
   * Do not store dates or datetimes in `VARCHAR`.  Instead, convert to the suitable MySQL datatype before storing.  (There are rare exceptions.)

ENUM Datatype


There are religious battles waging over whether to ever use ENUM. Here are some tips if you choose to use EHUM:

    ⚈  ENUMs are stored internally as a 1- or 2-byte number.
    ⚈  Advantage: You use strings instead of numbers; this makes code more readable.
    ⚈  Up to 255 options: an ENUM is only one byte. Else: 2 bytes.
    ⚈  Typically it is unwise to have an ENUM "nullable". Instead, make the first option something like 'unknown' or 'N/A'.
    ⚈  Don't have more than a very few options in an ENUM.
    ⚈  If you do need to add an option to an ENUM, the ALTER is essentially free if you add the new option on the end, and don't exceed 255 options.
    ⚈  Due to low cardinality, it is rarely useful to index an ENUM by itself. However, it may be quite useful to use it in a 'composite' (multi-column) index.
    ⚈  Example of an ENUM with a lot of optiona: rating ENUM('unrated', 'G', 'PG', 'PG-13', 'R', 'NC-17') DEFAULT 'unrated'

ENUM vs TINYINT vs VARCHAR


Suppose you have
CREATE TABLE ( ...
    col_enum ENUM('yes', 'no', 'maybe'),
    col_tiny TINYINT,     -- 0='yes', 1='no', 2='maybe'
    col_char VARCHAR(5),  -- 'yes', 'no', 'maybe'
    ,,,
What are the pros/cons of those 2 choices?

    ⚈  col_char is the bulkiest;
    ⚈  col_tiny gives no clue of "meaning";
    ⚈  col_enum cannot be extended to more values without doing an ALTER

You pick what you like. There is no solid winner.

SET Datatype


    ⚈  Rarely used.
    ⚈  Think of it as a multi-valued ENUM.
    ⚈  A SET cannot have more than 64 options.
    ⚈  It occupies CEIL(n/8) bytes, where n is the number of options.
    ⚈  It may be easier to use some size of INT instead of a SET, especially if you do boolean arithmetic.
    ⚈  It is not useful to index a SET column.
    ⚈  Example (refering to movies): available_features SET('Trailers', 'Commentaries', 'Deleted Scenes', 'Behind the Scenes')
    ⚈  See the function FIND_IN_SET().

Other Datatypes


    ⚈  Latitude/Longitude: DOUBLE is grossly excessive. See this for other options:
Representations_choices
    ⚈  IPv4 is no bigger than 123.123.123.123 -- VARCHAR(15) CHARACTER SET ascii, which can be converted to BINARY(4) via inet_aton(). The former can be compared only for [in]equality; the latter can be used for ranges.
    ⚈  IPv6 is no bigger than ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff -- VARCHAR(39) CHARACTER SET ascii, which can be converted to BINARY(16) via inet6_aton(). The former can be compared only for [in]equality; the latter can also be used for ranges.
    ⚈  Money: See DECIMAL, but be aware that different countries need different number of decimal places; 4 might be the current max. As for leading digits -- that is rather large, and growing daily.
    ⚈  Zipcode -- US zip-5 could be stored in CHAR(5) CHARACTER SET ascii or MEDIUMINT(5) UNSIGNED ZEROFILL. (Note the very rare usage of ZEROFILL.) Zip-9, Postal Codes, etc, probably deserve VARCHAR(n) CHARACTER SET ascii with a suitable n.
    ⚈  GUID/UUID -- These need CHAR(36) CHARACTER SET ascii. Better yet, convert to BINARY(16) with functions available in MySQL 8.0 or found in UUID functions
    ⚈  MD5/SHA1/etc -- These are like GUIDs/UUIDs, but need different sizes and can be converted via HEX() and UNHEX().
    ⚈  JSON -- MySQL and MariaDB vary on this.
    ⚈  JSON -- This is probably desirable: json_encode($str, JSON_UNESCAPED_UNICODE); it avoids \U+1234 codes instead of UTF-8 characters.
    ⚈  Address -- This is often broken into street_address, city, country_code, postal_code. But that is a bother. Simply have a string that containes all of those components if the only use is to print an address label. If you need to filter by postal_code or country_code, then do have them as separate columns.

NULL vs. NOT NULL


    ⚈  Unfortunately, NULL is the default.
    ⚈  If NULLable takes extra space, it is probably minuscule compared to the rest of the space taken. So, I don't make the argument on space.
    ⚈  A PRIMARY KEY cannot contain any nullable columns.
    ⚈  A UNIQUE key can contain a nullable column.
    ⚈  Use NOT NULL unless there you need an 'out of band' NULL value.

Some uses for NULL:
    ⚈  Don't have the value yet.
    ⚈  "Not applicable"
    ⚈  Value (or reference to other table) has been removed
    ⚈  "Unknown"

Empty vs. NULL


one discussion

DEFAULT


(except AUTO_INCREMENT, TIMESTAMP...)

Changes to TIMESTAMP in 5.x

sql_mode and "zero" dates

GENERATED columns


If PERSIST, then they take space.

Indexing issues

MySQL vs. MariaDB

Functional Indexes


Instead of creating a generated column so that you can index it, 8.0.18 allows you to index an expression. Example: INDEX(REVERSE(str)) ((rather lame))

PRIMARY KEY


    ⚈  Always explicitly provide one. While not an absolute requirement, it signals that you have thought about what the PK should be.
    ⚈  Each secondary key (in InnoDB) contains a copy of the PK; hence,...
    ⚈  A bulky PK leads to bulky secondary indexes, thereby leading to more disk space.
    ⚈  The implied rule above is "keep the PK small". In reality, the that rule only applies if you have two or more secondary keys. At one secondary key, the space is pretty much a wash.
    ⚈  A PK can either be "natural" or "surrogate"...

Natural versus Surrogate PRIMARY KEY

    ⚈  A "natural" PK is a column (or set of columns) that are 'naturally' unique.
    ⚈  A "surrogate" PK is (usually) INT UNSIGNED AUTO_INCREMENT. (The INT can be any of its sizes.) (UNSIGNED gives you double the range for free.)
    ⚈  I find that more than half of tables do have column(s) that could be the "natural" PK.
    ⚈  Some platforms automatically provide a surrogate PK for all tables. This can hurt performance in some cases. And it can waste disk space.

More PK tips, wives' tales, etc.

Many-to-many Mapping


    ⚈  An important pattern where "natural" is better than "surrogate" is in a many-to-many mapping table, which has 2 main columns, an those columns together are unique. Hence, that makes a good PK. The opposite ordering makes a good composite index for going the other way. See
many:many

Secondary Keys


    ⚈  Each secondary key is a separate B+Tree containing the column(s) of the secondary key, plus the column(s) of the PK.
    ⚈  I believe dup columns are removed.
    ⚈  Fetching a record using a secondary key involves first drilling down its BTree, then drilling down the BTree that contains the data (and is ordered by the PK).
    ⚈  If all the columns in a SELECT are found in the secondary key (as augmented by the PK), then the key is said to be "covering". EXLAIN show it by saying Using index (not 'Using index condition'). This gives a slight (sometimes 2x) performance boost.
    ⚈  When taking advantage of 'covering', it is best to spell out the columns you need. This lets the reader know that you planned for such.
    ⚈  When you have INDEX(a), INDEX(a,b), toss the former as redundant. (And, in at least one case, having both trips up the Optimizer.)

FOREIGN KEYs


    ⚈  FKs are good for maintaining the integrity of links between tables.
    ⚈  FKs cost a small extra effort to do the integrity check.
    ⚈  FKs need an index. Creating the FK will create the index if a suitable one does note exist already.

If you have trouble establishing FKs (error 150, 23000, 23001, etc), do one of these:

    ⚈  Disable FKs, CREATE TABLEs, re-enable FKs
    ⚈  Apply the FKs via ALTER after doing the relevant CREATE TABLEs
    ⚈  Change the order of the CREATE TABLEs

One-to-one Mapping / Hierarchy


    ⚈  A "one:one" mapping is usually a poor schema design. Join the tables together. (There are rare cases where performance, optional columns, etc, may lead to having two table in 1:1 relation.)
    ⚈  A "one:many" mapping is easily implemented by having the id for the "one" in the row of the "many".
    ⚈  A table representing a hierarchical tree has columns id and parent_id, plus PRIMARY KEY(id), INDEX(parent_id). The "root" of the tree usually has parent_id=0.
    ⚈  In MySQL 8.0 / MariaDB 10.2, see Recursive CTEs for traversing such trees.

PARTITIONing


    ⚈  Partitioning is rarely useful.
    ⚈  When switching between partitioned and not partitioned, all the indexes of the table should be reviewed are likely to be changed
    ⚈  More discussion:
Partitioning

SHOW CREATE TABLE is more descriptive than DESCRIBE


In DESC or DESCRIBE:

    ⚈  "MUL" does not tell you whether indexes are single-column or composite.
    ⚈  FOREIGN KEYs are ignored.
    ⚈  ENGINE is unavailable
    ⚈  etc, etc.
    ⚈  So, never use DESCRIBE TABLE, only use SHOW CREATE TABLE !

What to INDEX


    ⚈  "But I indexed every column" == Novice who does not understand indexing.
    ⚈  With rare exceptions, you need to develop the SELECTs (and UPDATEs and DELETEs) to see what indexes are needed for a table.
    ⚈  In many cases, "composite" indexes are the key to performance.
    ⚈  For a lot more discussion see:
Index Cookbook

Loading Data


https://stackoverflow.com/a/75559448/1766831 That talks about LOAD DATA, replacing the table, batch INSERTs, etc.

Normalize


    ⚈  Because the textbook says to -- Novice.
    ⚈  To centralize a value that could change -- The goal of the textbook.
    ⚈  To save space, hence improve performance -- The goal of a DBA.
    ⚈  But, do not normalize 'continuous' values (numbers, floats, dates, etc)

Computing row size


This is tricky. There are many hidden fields in a record in an InnoDB table and block and record. This will give you an approximation.

    1.  For each numeric column, add the number of bytes given above.
    2.  For each string column, add an estimate of the average length of the string. (This will go awry for big text/blob columns that end up "off-record")
    3.  Multiple by 3. This very crudely accounts for the overhead, padding, waste, indexes, etc.

Another way is to

    1.  CREATE TABLE ...
    2.  Populate it will many rows. How many rows? This gets tricky because the table will grow is spurts.
    3.  SHOW TABLE STATUS and look at Avg_row_length.

Single/double quotes, backtics, etc


    ⚈  Backtics must be used when the table/column name is also a reserved word. They may be left off otherwise. However, you run the real risk that in the next version of MySQL one of your variable names will have become a reserved word.
    ⚈  You may use backtics around any/all "names". Note that that is what mysqldump does. Exception: There is an SQL mode that allows double quotes to be used instead of backtics. (Very few users set that mode.)
    ⚈  Strings must have quotes (single or double) around them
    ⚈  Numbers may be quoted.
    ⚈  string_col = number should have quotes around the number to allow for use of an index.
    ⚈  When using "prepared" statements, quotes are supplied for you, so you must not quote the placeholders.
    ⚈  In many situations, the use of "prepared statements" or "binding" should definitely be used for security reasons. (cf SQL Injection and other hacks). Also, it provides both the escaping and quoting that is required. Using prepared statements allows you to insert your strings exactly as they are.

PHP functions


    ⚈  htmlspecialcharacters() for <input ...> things
    ⚈  urlencode() for pieces of urls
    ⚈  real_escape_string() for MySQL
    ⚈  addslashes() is mostly as good as real_escape_string
    ⚈  htmlentities() for things like "<" that might 'disappear' in HTML
    ⚈  Usually this option is advisable: json_encode($str, JSON_UNESCAPED_UNICODE)
    ⚈  "interpolation" (using $var inside quoted string) is more concise, but equivalent to concatenation. (OK, I am giving an opinion here.)
    ⚈  "interpolation" works only with double-quotes. So this won't work right: '... WHERE col = "$var"'

Naming


These are guidelines, not hard-and-fast rules.

    ⚈  By convention use the name id for a surrogate PK: id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY
    ⚈  Alternatively (to previous bullet), prefix the id with the table name: userid or user_id, company_id, etc
    ⚈  But don't prefix all the columns of a table with the table name. It clutters the queries and provides no useful information.
    ⚈  When JOINing, use an alias to qualify which table a column comes from. (Prefixing the column name cannot be trusted.)
    ⚈  When JOINing, it is usually best to give each table a short alias. This is a compromise between the clutter and the specification, above.
    ⚈  An alias might be the initial letter or the first letter of each 'word' in the name.
    ⚈  Don't blindly use LEFT JOIN everywhere. If the WHERE clause requires a specific value from the table, then it is really a JOIN (aka INNER JOIN).

Images


For images (etc) that will be used on a web page, I prefer to store the image in a file and store the URL to that file in the database. Then, in HTML, use that path in That makes loading the page arguably simpler and faster. (HTML handles opening a separate connection to download the image later in the processing.)

I also like to add height and width to the image tag so that the screen does not bounce around before it downloads the image. If you are linking off to, say, a pdf reader, the dimensions are not needed:


discussion

It is possible to store an image in a BLOB, but it is messy and probably slower than the alternative.

Tiny images could be included inline (PHP example):
$b64 = chunk_split(base64_encode($jpg));
$img = "";

Final note


The real test of a schema comes when you try to write the SELECTs, UPDATEs, etc for that schema. This is especially true for indexing -- you can't know the optimal indexes without knowing the queries.

Postlog

man page on data size

Examples

Datatype recommendations

Currency datatype

An SO Answer

Original posting: Aug, 2019 (after many years of assimilating ideas);   Minor refresh: Oct, 2021

Buffer_pool_size; UNIQUE indexes

Dealing with "Row size too large"

Getting started on table design


MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
    Data Warehouse Overview   Summary Tables   High speed ingestion   Bulk Normalization  

Schema and code design for large Sensor database

Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative

5 methods for 'Find Nearest'

Lat/Lng search to Find the nearest 10 pizza parlors
    Lat/Long representation choices

Z-Order 'find nearest'

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Efficient List of Latest 10 news articles

Build and execute a Pivot SELECT (showing rows as columns)

(Groupwise Max): Efficiently find largest row(s) for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
    Many-to-many mapping table   Handler counts   wp_postmeta   UNION+OFFSET

MySQL Limits -- built-in hard limits
    767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte and bagel ($10) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"