Galera is a multi-master cluster for MariaDB which replicates data using synchronous replication. Galera allows any nodes in the cluster to act as master and write to any node at a time. The active-active configuration of Galera cluster provides more load balancing and fault tolerance since there is no failover.
Although once configured, a Galera cluster can function without much of an issue, multiple node failures can cause havoc in a cluster. Anything that can cause a node to stop communicating with the cluster is considered a failed node. This can occur due to various reasons but certainly not limited to hardware, network, software failure or plain good old user error. But the good news is as long as there is a single good node standing, there is continuity of cluster service. One of the common issue after a node failure is the node being unable to rejoin the cluster.
In this tutorial, we are going to learn how to recover a Galera cluster from a crash and possibly prevent any future crashes.
Cluster Crash Scenarios
Depending on the number of nodes failure, we can categorize Galera cluster crashes in the following three scenarios:
- Single node failure
- Multi-node failure
- Full cluster failure
Out of three scenarios, Full cluster failure is what causes a complete loss of database connectivity. In normal circumstances, nodes will need to restart due to updates and regular maintenance. When a node is gracefully rebooted it, it simply rejoins the cluster and syncs itself with the rest of the cluster. During the node is rebooting it is normal to see a particular node missing from the cluster status. But it is not considered a cluster crash.
Verify Healthy Cluster
It is important to know how to check if a Galera cluster is healthy so that when disaster hits we can quickly check the status of the cluster. A healthy Galera cluster of three nodes will appear as the following from the database shell:
MariaDB [(none)]> show status like 'wsrep_incoming_addresses';
+--------------------------+----------------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------------+
| wsrep_incoming_addresses | 10.0.0.51:3306,10.0.0.52:3306,10.0.0.53:3306 |
+--------------------------+----------------------------------------------+
1 row in set (0.01 sec)
To check how many total nodes are currently the member of the cluster:
MariaDB [(none)]> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 3 |
+--------------------+-------+
1 row in set (0.01 sec)
To retrieve UUID of the cluster state:
MariaDB [(none)]> show status like 'wsrep_cluster_state_uuid';
+--------------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------------+
| wsrep_cluster_state_uuid | 345098abd2-291a-9893-acbd3-30923abcdef9 |
+--------------------------+-----------------------------------------+
1 row in set (0.01 sec)
To check if the member node is synced with the cluster:
MariaDB [(none)]> show status like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name | Value |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.01 sec)
Single Node Failure Recovery
In this scenario, only one node in the cluster failed. No data is lost nor database connectivity is interrupted. During a node failure in a three node cluster, the status will appear as follows:
MariaDB [(none)]> show status like 'wsrep_incoming_addresses';
+--------------------------+-------------------------------+
| Variable_name | Value |
+--------------------------+-------------------------------+
| wsrep_incoming_addresses | 10.0.0.51:3306,10.0.0.52:3306 |
+--------------------------+-------------------------------+
1 row in set (0.01 sec)
MariaDB [(none)]> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 2 |
+--------------------+-------+
1 row in set (0.01 sec)
After the failed node is rebooted, check the status again to ensure it has rejoined the cluster. If for some reason it did not rejoin, then simply restart the MariaDB:
$ systemctl restart mariadb
Multi-node Crash Recovery
In this scenario, all nodes except one fail which causes loss of quorum. At this stage, Galera cluster can no longer process SQL requests. Since one node is still up and running, no data is lost. If failed nodes do come back online they cannot rejoin the cluster, because the cluster does not exist. We can verify this using the cluster size and cluster status command from the surviving node:
MariaDB [(none)]> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 1 |
+--------------------+-------+
1 row in set (0.01 sec)
MariaDB [(none)]> show status like 'wsrep_cluster_status';
+----------------------+---------+
| Variable_name | Value |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+
1 row in set (0.01 sec)
In a rare case, the value of cluster status may show non-Primary. If it does then the error may not just be the loss of quorum but also network connectivity. Ensure the node cluster status returns value Primary before proceeding with achieving quorum.
Before we can proceed with achieving quorum, we have to ensure that the surviving node indeed has the latest commits. We can check it by viewing the content of There are two ways we can reset the quorum so other nodes can rejoin the cluster:
Automatic Bootstrap
The simplest way to reset the quorum is by automatic bootstrap. We can run the following command in the database shell to auto bootstrap the node:
MariaDB [(none)]> set global wsrep_provider_options='pc.bootstrap=YES';
This will bootstrap the surviving node to be the primary starting node so other failed nodes can rejoin the cluster.
Manual Bootstrap
Run the following commands to manually bootstrap the node:
$ systemctl stop mariadb
$ galera_new_cluster
$ systemctl restart mariadb
After the primary node is up and running, restart MariaDB service on all the remaining nodes one at a time.
Full Cluster Recovery
In this scenario, all nodes failed or did not gracefully shutdown. Total loss of quorum occurred and the cluster is not accepting any SQL requests. After a hard crash such as this, even if all the nodes come back online the MariaDB service will be unable to start. This is due to the unclean shutdown and none of the nodes were able to do the last commit. A Galera cluster can crash in various ways resulting in different methods to recover from a full crash.
Recovery Based On Highest seqno Value
This method is helpful when there is a slight possibility that at least one node was able to gracefully shutdown during the crash. The node with the latest data will have the highest seqno value among all the crashed nodes. We can find the clue in the content of which/var/lib/mysql/grastate.dat
will show the value of seqno. Depending on the nature of crash either all of the nodes will have identical negative seqno value or one of the nodes will have the highest positive seqno value.
Following shows the content of grastate.dat
in node 3. This node has negative seqno and no group ID. This the case when a node crashes during Data Definition Language (DDL) processing:
$ cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 00000000-0000-0000-0000-000000000000
seqno: -1
safe_to_bootstrap: 0
Following shows the content of grastate.dat
in node 2. This node crashed during transaction processing resulting in negative seqno but with group ID:
$ cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 886dd8da-3d07-11e8-a109-8a3c80cebab4
seqno: -1
safe_to_bootstrap: 0
Following is the content of grastate.dat on node 1 with highest seqno value:
$ cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 886dd8da-3d07-11e8-a109-8a3c80cebab4
seqno: 31929
safe_to_bootstrap: 1
Note that a node will only have positive highest seqno value when the node was able to gracefully shutdown. This is the node need to be recovered first.
If all the nodes contain the value of -1
for seqno and 0
for safe_to_bootstrap
,
that is an indication that a full cluster crash has occurred. At this point, we could start the cluster using the command galera_new_cluster.
But it is not recommended at all since there is no way to know that each node has an identical copy of the database data.
Before restarting the node 1 we need to make a change in the cluster configuration file /etc/my.cnf.d/server.cnf to remove the mention of IPs of cluster nodes. Following is the content of [galera] section of the configuration before any changes:
[galera]
# Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address="gcomm://10.0.0.51,10.0.0.52,10.0.0.53"
wsrep_cluster_name='galeraCluster01'
wsrep_node_address='10.0.0.51'
wsrep_node_name='galera-01'
wsrep_sst_method=rsync
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
Note that wsrep_cluster_address
shows the IP of all member nodes. We need to remove the addresses as follows:
wsrep_cluster_address="gcomm://"
We can now restart the mariadb service in this node:
$ systemctl restart mariadb
Only after verifying that the service started successfully we can proceed to restart services on the other nodes one at a time. Only after all nodes are successfully running, we need to edit the cluster configuration on node 1 to add the IP addresses of all the member nodes and restart the service:
wsrep_cluster_address="gcomm://10.8.8.53,10.8.8.54,10.8.8.55"
The Galera cluster should be up and running at this point and all nodes should be syncing with the surviving node.
Recovery Based On Last Committed
This is the worst case scenario of a Galera cluster crash where all nodes have completely crashed resulting seqno value of -1. As mentioned earlier, resist the temptation of running the command galera_new_cluster on a node then trying to rejoin rest of the nodes to the cluster before checking which node has the latest commit. When galera_new_cluster command is used, it actually creates a new cluster with a new set of IDs then all other nodes join into it starting clean sync.
To check which node has the last commit we can check the value of wsrep_last_commit on each node separately. The node with the highest value is the one with the latest commit. We can bootstrap that node to start the cluster then join other member nodes. This process is similar to bootstraping the node with highest seqno as we have seen in the previous section.
Stop mariadb service:
$ systemctl stop mariadb
Edit wsrep_cluster_address in [galera] section of /etc/my.cnf.d/server.cnf to remove mention of member nodes:
wsrep_cluster_address="gcomm://"
Restart mariadb service:
$ systemctl start mariadb
From database shell check the last committed value:
MariaDB [(none)]> show status like 'wsrep_last_committed';
+----------------------+---------+
| Variable_name | Value |
+----------------------+---------+
| wsrep_last_committed | 319589 |
+----------------------+---------+
1 row in set (0.01 sec)
Repeat the process on all nodes to retrieve the last committed value. The node with the latest data will have the highest value. Create a new cluster on the node with the highest committed value:
$ galera_new_cluster
Change the value of wsrep_cluster_address
on rest of the nodes to mention IP addresses, then restart mariadb service one node at a time. The cluster should be up and running and data sync should commit all the changes. Check the last committed value on all nodes after a while to ensure nodes are now in sync.