Galera Caveats

Table of Contents

Best/Required Practices when using Galera for HA

Galera is available in many places

Overview of cross-colo Writing

Write Scaling

AUTO_INCREMENT

InnoDB only

Check after COMMIT

Always have PRIMARY KEY

Transaction "size"

Critical Reads

MyISAM and MEMORY

Replicating GRANTs

ALTERs

Single "Master" Configuration

DBA Tricks

VARIABLES that may need to be different

Miscellany

A note on GTIDs

How many nodes to have in a Cluster

Group Replication

Postlog

Brought to you by Rick James

Best/Required Practices when using Galera for HA


These topics will be discussed in more detail below.

Dear Schema Designer:
    ⚈  InnoDB only, always have PK.

Dear Developer:
    ⚈  Check for errors, even after COMMIT.
    ⚈  Moderate sized transactions.
    ⚈  Don't make assumptions about AUTO_INCREMENT values other than uniqueness. Not monotonicity, not gapless. (This applies for non-Galera, but probably happens more often here.)
    ⚈  Handling of "critical reads" is quite different (arguably better).
    ⚈  Read/Write split is not necessary, but is still advised in case the underlying structure changes in the future.

Dear DBA:
    ⚈  Building the machines is quite different. (Not covered here)
    ⚈  ALTERs are handled differently.
    ⚈  TRIGGERs and EVENTs may need checking.
    ⚈  Tricks in replication (eg, BLACKHOLE) may not work. binlog/replicate-do/ignore have limitations.
    ⚈  Several VARIABLEs need to be set differently.

Galera is available in many places


Galera's High Availability replication is available via
    ⚈  Percona XtraDB Cluster
    ⚈  MariaDB (Standard in 10.1)
    ⚈  Galera installed by yourself on ordinary MySQL, Percona 5.6, or MariaDB 10.0.

Overview of cross-colo Writing


(This overview is valid even for same-datacenter nodes, but the issues of latency vanish.)

Cross-colo latency is an 'different' than with traditional replication, but not necessarily better or worse with Galera. The latency happens at a very different time for Galera.

In 'traditional' replication, these steps occur:
    ⚈  Client talks to Master. If Client and Master are in different colos, this has a latency hit.
    ⚈  Each SQL to Master is another latency hit, including(?) the COMMIT (unless using autocommit).
    ⚈  Replication to Replica(s) is asynchronous, so this does not impact the client writing to the Master.
    ⚈  Since replication is asynchronous, a Client (same or subsequent) cannot be guaranteed to see that data on the Replica. This is a "critical read". The async Replication delay forces apps to take some evasive action.

In Galera-based replication:
    ⚈  Client talks to any Master -- possibly with cross-colo latency. Or you could arrange to have Galera nodes co-located with clients to avoid this latency.
    ⚈  At COMMIT time (or end of statement, in case of autocommit=1), Galera makes one roundtrip to other nodes.
    ⚈  The COMMIT usually succeeds, but could fail if some other node is messing with the same rows. (Galera retries on autocommit failures.)
    ⚈  Failure of the COMMIT is reported to the Client, who should simply replay the SQL statements from the BEGIN.
    ⚈  Later, the whole transaction will be applied (with possibility of conflict) on the other nodes.
    ⚈  Critical Read -- details below

For an N-statement transaction: In a typical 'traditional' replication setup:
    ⚈  0 or N (N+2?) latency hits, depending on whether the Client is co-located with the Master.
    ⚈  Replication latencies and delays lead to issues with "Critical Reads".

In Galera:
    ⚈  0 latency hits (assuming Client is 'near' some node)
    ⚈  1 latency hit for the COMMIT.
    ⚈  0 (usually) for Critical Read (details below)

Bottom line: Depending on where your Clients are, and whether you clump statements into BEGIN...COMMIT transacitons, Galera may be faster or slower than traditional replication in a WAN topology.

Write Scaling


With 'traditional' replication, especially if using "Statement Based Replication", all writes to the Master are applied to all Replicas, and they are replayed serially. Even with a multi-master setup, all writes are applied to all Masters. Hence, there is no way to get "write scaling"; that is, no way to increase the number of writes beyond what a single Master can handle.

With Galera, there is a moderate degree of write scaling.
    ⚈  All nodes can perform writes in parallel.
    ⚈  Writes are replicated via "Row Based Replication", which has less impact for single-row operations.
    ⚈  Writes are applied in parallel on the 'replicas', up to the setting wsrep_slave_threads. This is safe because of the way COMMITs work.
    ⚈  Reads in support of the Writes are spread across the Masters because the Writes are.

There is no perfect number for wsrep_slave_threads; it varies with number of CPU cores, client connections, etc. A few dozen is likely to be optimal. Hence, this allows a significant degree of write scaling. On the other hand, if you are I/O-bound, that would be a scaling limitation.

AUTO_INCREMENT


By using wsrep_auto_increment_control = ON, the values of auto_increment_increment and auto_increment_offset will be automatically adjusted as nodes come/go.

If you are building a Galera cluster by starting with one node as a Slave to an existing non-Galera system, and if you have multi-row INSERTs that depend on AUTO_INCREMENTs, the read this
Percona blog

Bottom line: There may be gaps in AUTO_INCREMENT values. Consecutive rows, even on one connection, will not have consecutive ids.

Beware of Proxies that try to implement a "read/write split". In some situations, a reference to LAST_INSERT_ID() will be sent to a "Slave".

Sequence (as replacement for auto_increment)

InnoDB only


For effective replication of data, you must use only InnoDB. This eliminates
    ⚈  FULLTEXT index (until 5.6)
    ⚈  SPATIAL index (until 5.7)
    ⚈  MyISAM's PK as second column

You can use MyISAM and MEMORY for data that does not need to be replicated.

Also, you should use "START TRANSACTION READONLY" wherever appropriate.

Check after COMMIT


Check for errors after issuing COMMIT. A "deadlock" can occur due to writes on other node(s). Error: 1213 SQLSTATE: 40001 (ER_LOCK_DEADLOCK. If you receive this error, restart the failing transaction.

Possible exception (could be useful for legacy code without such checks): Treat the system as single-Master, plus Replicas. By writing only to one node, COMMIT should always succeed(?)

What about autocommit = 1? wsrep_retry_autocommit tells Galera to retry (N times) if a single statement that is autocommited fails. So, there is still a chance (very slim) of getting a deadlock on such a statement. The default setting of "1" retry is probably good.

Always have PRIMARY KEY


"Row Based Replication" will be used; this requires a PK on every table.

A non-replicated table (eg, MyISAM) does not have to have a PK.

Transaction "size"


(This section assumes you have Galera nodes in geographically separate datacenters.) Because of some of the issues discussed, it is wise to group your write statements into moderate sized BEGIN...COMMIT transactions. There is one latency hit per COMMIT or autocommit. So, combining statements will decrease those hits. On the other hand, it is unwise (for other reasons) to make huge transactions, such as inserting/modifying millions of rows in a single transaction.

To deal with failure on COMMIT, design your code so you can redo the SQL statements in the transaction without messing up other data. For example, move "normalization" statements out of the main transaction; there is arguably no compelling reason to roll them back if the main code rolls back.

In any case, doing what is "right" for the business logic overrides other considerations.

Galera's tx_isolation is between Serializable and Repeatable Read. tx_isolation variable is ignored.

Set wsrep_log_conflicts to get errors put in the regular MySQL mysqld.err.

XA transactions cannot be supported. (Galera is already doing a form of XA in order to do its thing.)

This discussion applies to a very large LOAD DATA.

Critical Reads


Here is a 'simple' (but not 'free') way to assure that a read-after-write, even from a different connection, will see the updated data.
   SET SESSION wsrep_sync_wait = 1;
   SELECT ...
   SET SESSION wsrep_sync_wait = 0;
For non-SELECTs, use a different bit set for the first select. (TBD: Would 0xffff always work?) (Before Galera 3.6 / MySQL 5.6.20, it was wsrep_causal_reads = ON.)
Percona Doc for wsrep_sync_wait
Galera doc for wsrep_sync_wai

This setting stalls the SELECT until all current updates have been applied to the node. That is sufficient to guarantee that a previous write will be visible. The time cost is usually zero. However, a large UPDATE could lead to a delay. Because of RBR and parallel application, delays are likely to be less than on traditional replication. Zaitsev's blog

It may be more practical (for a web app) to simply set wsrep_sync_wait right after connecting. But that might slow down non-critical reads.

MyISAM and MEMORY


As said above, use InnoDB only. However, here is more info on the MyISAM issues.

MyISAM and MEMORY tables are not replicated.

Having MyISAM not replicated can be a big benefit -- You can CREATE TEMPORARY TABLE ... ENGINE=MyISAM and have it exist on only one node. RBR assures that any data transferred from that temp table into a 'real' table can still be replicated.

Replicating GRANTs


GRANTs and related operations act on the MyISAM tables in the database mysql. GRANT and CREATE USER statements will be replicated, but the underlying tables will not. However, INSERT INTO mysql.user ... will not be replicated because the table is MyISAM`.

ALTERs


Many DDL changes on Galera can be achieved without downtime, even if they take a long time.

RSU vs TOI:
    ⚈  Rolling Schema Upgrade = manually do one node (offline) at a time
    ⚈  Total Order Isolation = Galera synchronizes so that it is done at same time (in the replication sequence) on all nodes.
RSU and TOI

Caution: Since there is no way to synchronize the clients with the DDL, you must make sure that the clients are happy with either the old or the new schema. Otherwise, you will probably need to take down the entire cluster while simultaneously switching over both the schema and the client code.

A "fast" DDL may as well be done via TOI. This is a tentative list of such:
    ⚈  CREATE/DROP/RENAME DATABASE/TABLE
    ⚈  ALTER to change DEFAULT
    ⚈  ALTER to change definition of ENUM or SET (see caveats in manual)
    ⚈  Certain PARTITION ALTERs that are fast.
    ⚈  DROP INDEX (other than PRIMARY KEY)
    ⚈  ADD INDEX?
    ⚈  Other ALTERs on 'small' tables.
    ⚈  With 5.6 and especially 5.7 having a lot of ALTER ALGORITHM=INPLACE cases, check which ALTERs should be done which way.

Otherwise, use RSU. Do the following separately for each node:
   SET GLOBAL wsrep_OSU_method='RSU';  -- This also takes the node out of the cluster.
   ALTER TABLE ...
   SET GLOBAL wsrep_OSU_method='TOI';  -- Puts back in, leading to resync (hopefully a quick IST, not a slow SST)

More discussion of RSU procedures
More discussion on TOI

Single "Master" Configuration


You can 'simulate' Master + Replicas by having clients write only to one node.
    ⚈  No need to check for errors after COMMIT.
    ⚈  Lose the latency benefits.
    ⚈  Lose some of the write scalability.

DBA Tricks


    ⚈  Remove node from cluster; back it up; put it back in. Syncup is automatic.
    ⚈  Remove node from cluster; use it for testing, etc; put it back in. Syncup is automatic.
    ⚈  Rolling hardware/software upgrade: Remove; upgrade; put back in. Repeat.

VARIABLES that may need to be different


    ⚈  auto_increment_increment - If you are writing to multiple nodes, and you use AUTO_INCREMENT, then auto_increment_increment will automatically be equal the current number of nodes.
    ⚈  binlog-do/ignore-db - Do not use.
    ⚈  binlog_format - ROW is required for Galera.
    ⚈  innodb_autoinc_lock_mode - 2
    ⚈  innodb_doublewrite - ON: When an IST occurs, want there to be no torn pages? (With FusionIO or other drives that guarantee atomicity, OFF is better.)
    ⚈  innodb_flush_log_at_trx_commit - 2 or 0. IST or SST will recover from loss.
    ⚈  query_cache_size - 0
    ⚈  query_cache_type - 0: The query cache needs to be disabled for MariaDB Galera cluster versions prior to "5.5.40-galera", "10.0.14-galera" and "10.1.2".
    ⚈  wsrep_auto_increment_control - Normally want ON
    ⚈  wsrep_on - ON
    ⚈  wsrep_provider_options - Various settings may need tuning if you are using a WAN.
    ⚈  wsrep_slave_threads - use for parallel replication
    ⚈  wsrep_sync_wait (previously wsrep_causal_reads) - used transiently to deal with "critical reads".

Miscellany


Galera is not available on Windows or Mac.

FOREIGN KEYs were originally buggy.

LOAD DATA is auto-chunked. That is, it is passed to other nodes piecemeal, not all at once.

Known issues with Galera

DROP USER may not replicate?

A slight difference in ROLLBACK for conflict: InnoDB rolls back smaller transaction; Galera rolls back last.

Slide Deck for Galera

SET GLOBAL wsrep_debug = 1; leads to a lot of debug info in the error log.

Large UPDATEs / DELETEs should be broken up. This admonition is valid for all databases, but there are additional issues in Galera.

WAN: May need to increase (from the defaults) wsrep_provider_options = evs...

Slide show

log_output = TABLE is not supported -- use FILE

Do not use LOCK TABLE, etc.

Do not use GET_LOCK(), etc.

A note on GTIDs


5.6 and Galera each use GTIDs (Global Transaction IDs). But they are different.

A 5.6 GTID is the combination of a server UUID and a sequence number.

A Galera GTID is the combination of a cluster UUID and a sequence number.

Do not set character_set_server to UTF-16, UTF-32 or UCS-2. (Those are virtually useless for any purpose.)

How many nodes to have in a Cluster


If all the servers are in the same 'vulnerability zone' -- eg, rack or data center -- Have an odd number (at least 3) of nodes.

When spanning colos, you need 3 (or more) data centers in order to be 'always' up, even during a colo failure. With only 2 data centers, Galera can automatically recover from one colo outage, but not the other. (You pick which.)

If you use 3 or 4 colos, these number of nodes per colo are safe:
    ⚈  3 nodes: 1+1+1 (1 node in each of 3 colos)
    ⚈  4 nodes: 1+1+1+1 (4 nodes won't work in 3 colos)
    ⚈  5 nodes: 2+2+1, 2+1+1+1 (5 nodes spread 'evenly' across the colos)
    ⚈  6 nodes: 2+2+2, 2+2+1+1
    ⚈  7 nodes: 3+2+2, 3+3+1, 2+2+2+1, 3+2+1+1
There may be a way to "weight" the nodes differently; that would allow a few more configurations. With "weighting", give each colo the same weight; then subdivide the weight within each colo evenly. Four nodes in 3 colos: (1/6+1/6) + 1/3 + 1/3 That way, any single colo failure cannot lead to "split brain".

Contrary to popular belief, 4 nodes (when suitably weighted) is not at risk of a 'single point of failure'.

Group Replication


Oracle's "Group Replication" (GA in Dec 2016) is a serious competitor to Galera. I don't yet have a comparable review of it, but here are two links that look promising:

Comparision of Galera and Group Replication
Another Comparision of Galera and Group Replication

(Oracle's "Fabric" has admitted defeat.)

Postlog


Posted 2013;   VARIABLES: 2015;   Refreshed: Feb. 2016;   Minor Refresh: Sep. 2017;   Minor Refresh: June 2018;   Update links: Jan 2019;  

MariaDB's page on Limitations
FromDual's page on Limitations
SeveralNines's page on Limitations

Some clarifications

Sequence (as replacement for auto_increment)

-- Rick James

MySQL Documents by Rick James

HowTo Techniques for Optimizing Tough Tasks:

Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)

Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
    Chunking lengthy DELETE/UPDATE/etc.

Data Warehouse techniques:
    Data Warehouse Overview   Summary Tables   High speed ingestion   Bulk Normalization  

Schema and code design for large Sensor database

Entity-Attribute-Value (EAV) -- a common, poorly performing, design pattern; plus an alternative

5 methods for 'Find Nearest'

Lat/Lng search to Find the nearest 10 pizza parlors
    Lat/Long representation choices

Z-Order 'find nearest'

Pagination, not with OFFSET, LIMIT

Techniques on efficiently finding a random row (On beyond ORDER BY RAND())

GUID/UUID Performance (type 1 only)

IP Range Table Performance -- or other disjoint ranges

Rollup Unique User Counts

Alter of a Huge table -- Mostly obviated by 5.6

Efficient List of Latest 10 news articles

Build and execute a Pivot SELECT (showing rows as columns)

(Groupwise Max): Efficiently find largest row(s) for each group

Other Tips, Tuning, Debugging, Optimizations, etc...

Rick's RoTs (Rules of Thumb -- lots of tips)

Datatypes and building a good schema

Memory Allocation (caching, etc)

Character Set and Collation problem solver
    Trouble with UTF-8   If you want case folding, but accent sensitivity, please file a request at https://bugs.mysql.com .
    Python tips,   PHP tips,   other language tips
    utf8 Collations   utf8mb4 Collations on 8.0

Converting from MyISAM to InnoDB -- includes differences between them

Compound INDEXes plus other insights into the mysteries of INDEXing

Cookbook for Creating Indexes
    Many-to-many mapping table   Handler counts   wp_postmeta   UNION+OFFSET

MySQL Limits -- built-in hard limits
    767-byte INDEX limit

Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)

5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users

Analyze MySQL Performance
    Analyze VARIABLEs and GLOBAL STATUS     Analyze SlowLog

My slides from conferences
MiniFest 2021 - Rick James & Daniel Black - Answering on Stack Overflow(+comments) - MariaDB Frontlines
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB

Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:   View Rick James's profile on LinkedIn

Did my articles help you out? Like what you see? Consider donating:

☕️ Buy me a Banana Latte and bagel ($10) There is no obligation but it would put a utf8mb4 smiley 🙂 on my face, instead of the Mojibake "🙂"