Best/Required Practices when using Galera for HA
These topics will be discussed in more detail below.
Dear Schema Designer:
⚈ InnoDB only, always have PK.
⚈ Check for errors, even after COMMIT.
⚈ Moderate sized transactions.
⚈ Don't make assumptions about AUTO_INCREMENT values.
⚈ Handling of "critical reads" is quite different (arguably better).
⚈ Read/Write split is not necessary, but is still advised in case the underlying structure changes in the future.
⚈ Building the machines is quite different. (Not covered here)
⚈ ALTERs are handled differently.
⚈ TRIGGERs and EVENTs may need checking.
⚈ Tricks in replication (eg, BLACKHOLE) may not work.
Galera is available in many places
Galera's High Availability replication is available via
⚈ Percona XtraDB Cluster
⚈ Galera installed by yourself on ordinary MySQL
Overview of cross-colo Writing
(This overview is valid even for same-colo nodes, but the issues of latency vanish.)
Cross-colo latency is an 'different', but not necessarily better or worse with Galera. The latency happens at a very different time for Galera.
In 'traditional' replication, these steps occur:
⚈ Client talks to Master. If Client and Master are in different colos, this has a latency hit.
⚈ Each SQL to Master is another latency hit, including(?) the COMMMIT (unless using autocommit).
⚈ Replication to Slave(s) is asynchronous, so this does not impact the client writing to the Master.
⚈ Since replication is asynchronous, a Client (same or subsequent) cannot be guaranteed to see that data on the Slave. This is a "critical read". The async Replication delay forces apps to take some evasive action.
In Galera-based replication:
⚈ Client talks to any Master -- probably cross-colo latency.
⚈ At COMMIT time (or end of statement, in case of autocommit=1), galera makes one roundtrip to other nodes.
⚈ The COMMIT usually succeeds, but could fail if some other node is messing with the same rows.
⚈ Failure of the COMMIT is reported to the Client, who should simply replay the SQL statements from the BEGIN.
⚈ Later, the whole transaction will be applied (with possibility of conflict) on the other nodes.
⚈ Critical Read -- details below
For an N-statement transaction: In a typical 'traditional' replication setup:
⚈ 0 or N (N+2?) latency hits, depending on whethere Client is 'near' Master.
⚈ Replication latencies and delays leading to issues with "Critical Reads".
⚈ 0 latency hits (assuming Client is 'near' some node)
⚈ 1 latency hit for the COMMIT.
⚈ 0 (usually) for Critical Read (details below)
Bottom line: Depending on where you Clients are, and whether you clump statements into BEGIN...COMMIT transacitons, Galera may be faster or slower than traditional replication.
With 'traditional' replication, especially if using "Statement Based Replication", all writes to the Master are applied to all Slaves, and they are replied serially. Even with a multi-master setup, all writes are applied to all Masters. Hence, there is no way to get "write scaling"; that is, no way to increase the number of writes beyond what a single Master can handle.
With Galera, there is a moderate degree of write scaling.
⚈ All nodes can perform writes in parallel.
⚈ Writes are replicated via "Row Based Replication", which is decidely faster for single-row operations.
⚈ Writes are applied in parallel on the 'slaves', up to the setting wsrep_slave_threads. This is safe because of the way COMMITs work.
There is no perfect number for wsrep_slave_threads; it varies with number of CPU cores, client connections, etc. A few dozen is likely to be optimal. Hence, this allows a significant degree of write scaling. If you are I/O-bound, that would be a scaling limitation.
By using wsrep_auto_increment_control = ON, the values of auto_increment_increment and auto_increment_offset will be automatically adjusted as nodes come/go.
If you are building a Galera cluster by starting with one node as a Slave to an existing non-Galera system, and if you have multi-row INSERTs that depend on AUTO_INCREMENTs, the read this blob
Bottom line: There may be gaps in AUTO_INCREMENT values. Consecutive rows, even on one connection, will not have consecutive ids.
For effective replication of data, you must use only InnoDB. This eliminates
⚈ FULLTEXT index (until 5.6)
⚈ SPATIAL index
⚈ MyISAM's PK as second column
You can use MyISAM and MEMORY for data that does not need to be replicated.
Also, you should use "START TRANSACTION READONLY" wherever appropriate.
Check after COMMIT
Check for errors after issuing COMMIT. A "deadlock" can occur due to writes on other node(s).
Possible exception (could be useful for legacy code without such checks): Treat the system as single-Master, plus Slaves. By writing only to one node, COMMIT should always succeed(?)
What about autocommit = 1? wsrep_retry_autocommit tells Galera to retry if a single statement that is "autocommit"d N times. So, there is still a chance (very slim) of getting a deadlock on such a statement. The default setting of "1" (retry) is probably good.
Always have PRIMARY KEY
"Row Based Replication" will be used; this requires a PK on every table.
A non-replicated table (eg, MyISAM) does not have to have a PK.
(This section assumes you have Galera nodes in multiple colos.) Because of some of the issues discussed, it is wise to group your write statements into moderate sized BEGIN...COMMIT transactions. There is one latency hit per COMMIT or autocommit. So, combining statements will decrease those hits. On the other hand, it is unwise (for other reasons) to make huge transactions, such as inserting/modifying millions of rows in a single transaction.
To deal with failure on COMMIT, design your code so you can redo the SQL statements in the transaction without messing up other data.
In any case, doing what is "right" for the business logic overrides this minor optimization.
XA transactions cannot be supported.
Here is a 'simple' (but not 'free') way to assure that a read-after-write, even from a different connection, will see the updated data.
SET SESSION wsrep_sync_wait = 1;
SET SESSION wsrep_sync_wait = 0;
For non-SELECTs, use a different bit set for the first select.
(TBD: Would 0xffff always work?)
(Before Galera 3.6, it was wsrep_causal_reads = ON.)
Doc for wsrep_sync_wait
This setting stalls the SELECT until all current updates have been applied to the node. That is sufficient to guarantee that a previous write will be visible. The time cost is usually zero. However, a large UPDATE could lead to a delay. Because of RBR and parallel application, delays are likely to be less than on traditional replication. Zaitsev's blog
It may be more practical (for a web app) to simply set wsrep_sync_wait right after connecting.
As said above, use InnoDB only. However, here is more info on the MyISAM (and hence FULLTEXT, SPATIAL, etc) issues.
MyISAM is, by default, not replicated. (How to turn on??)
Having MyISAM not replicated can be a big benefit -- You can "CREATE TEMPORARY TABLE ... ENGINE=MyISAM" and have it exist on only one node. RBR assures that any data transferred from that temp table into a 'real' table can still be replicated.
GRANTs and related operations act on the MyISAM tables in the database `mysql`. The GRANT statements will(?) be replicated, but the underlying tables will not.
Many DDL changes can be achieved without downtime, even if they take a long time.
RSU vs TOI:
⚈ Rolling Schema Upgrade = manually do one node (offline) at a time
⚈ Total Order Isolation = Galera synchronizes so that it is done at same time (ie replication sequence) on all nodes.
RSU and TOI
Caution: Since there is no way to synchronize the clients with the DDL, you must make sure that the clients are happy with either the old or the new schema. Otherwise, you will probably need to take down the entire cluster while simultaneously switching over both the schema and the client code.
A "fast" DDL may as well be done via TOI. This is a tentative list of such:
⚈ CREATE/DROP/RENAME DATABASE/TABLE
⚈ ALTER to change DEFAULT
⚈ ALTER to change definition of ENUM or SET (see caveats in manual)
⚈ Certain PARTITION ALTERs that are fast.
⚈ DROP INDEX (other than PRIMARY KEY)
⚈ ADD INDEX?
⚈ Other ALTERs on 'small' tables.
Otherwise, use RSU. Do the following for each node:
⚈ Take node out of the cluster
⚈ Apply the DDL.
⚈ Put back into the cluster. (It will automatically sync.)
⚈ ?? RBR vs schema changes
Single "Master" Configuration
You can 'simulate' Master + Slaves by having clients write only to one node.
⚈ No need to check for errors after COMMIT.
⚈ Lose the latency benefits.
⚈ Lose some of the write scalability.
⚈ Remove node from cluster; back it up; put it back in. Syncup is automatic.
⚈ Remove node from cluster; use it for testing, etc; put it back in. Syncup is automatic.
⚈ Rolling hardware/software upgrade: Remove; upgrade; put back in. Repeat.
Do not use binlog-do-db and binlog-ignore-db.
Since no single node needs "durability", sync_binlog and innodb_flush_log_at_trx_commit are not needed. Set both to 0 (or possibly 2 for trx).
The Query cache is disabled by Galera, due to implementation difficulties and lack of strong need.
Until recently, FOREIGN KEYs were buggy.
LOAD DATA is auto-chunked. That is, it is passed to other nodes piecemeal, not all at once.
MariaDB's known issues with Galera
In a WAN cluster, traditional single-Master incurs a hop or two for remote actions. With Galera, the Master may be local, but there is one hop to sync the write. There are other examples of latency issues; one can argue that overall Galera has similar performance characteristics to single-Master.
DROP USER may not replicate?
A slight difference in ROLLBACK for conflict: InnoDB rolls back smaller transaction; Galera rolls back last.
Slide Deck for Galera
A note on GTIDs
5.6 and Galera each use GTIDs (Global Transaction IDs).
A 5.6 GTID is the combination of a _server_ UUID and a sequence number.
A Galera GTID is the combination of a _cluster_ UUID and a sequence number.
How many nodes to have in a Cluster
Have an odd number (at least 3) of nodes if all the servers are in the same 'vulnerability zone' -- eg, rack or data center.
When spanning colos, you need 3 data centers in order to be 'always' up, even during a colo failure. With only 2 data centers, Galera can survive one colo outage, but not the other. (You pick which.)
If you use 3 or 4 colos, these number of nodes per colo are safe:
⚈ 3 nodes: 1+1+1
⚈ 4 nodes: 1+1+1+1 (4 nodes won't work in 3 colos)
⚈ 5 nodes: 2+2+1, 2+1+1+1
⚈ 6 nodes: 2+2+2, 2+2+1+1
⚈ 7 nodes: 3+2+2, 3+3+1, 2+2+2+1, 3+2+1+1
There may be a way to "weight" the nodes differently; that would allow a few more configurations. With "weighting", give each colo the same weight; then subdivide the weight within each colo evenly. Four nodes in 3 colos: (1/6+1/6) + 1/3 + 1/3 That way, any single colo failure cannot lead to "split brain".
Large UPDATEs / DELETEs should be broken up. This admonition is valid for all databases, but there are additional issues in Galera.
WAN: May need to increase (from the defaults) wsrep_provider_options = evs...
MySQL/Perona 5.6 or MariaDB 10 is recommended when going to Galera.