Learn how to bootstrap your MySQL cluster in the event of a cluster failure.
You can bootstrap your cluster by using one of two methods:
You must bootstrap a cluster that loses quorum. A cluster loses quorum when fewer than half the nodes can communicate with each other for longer than the configured grace period. If a cluster does not lose quorum, individual unhealthy nodes automatically rejoin the cluster after resolving the error, restarting the node, or restoring connectivity.
To check whether your cluster has lost quorum, look for the following symptoms:
All nodes appear “Unhealthy” on the proxy dashboard. To log in to the proxy dashboard, see Log in to proxy dashboard.
All responsive nodes report the value of wsrep_cluster_status
as non-Primary
:
mysql> SHOW STATUS LIKE 'wsrep_cluster_status'; +----------------------+-------------+ | Variable_name | Value | +----------------------+-------------+ | wsrep_cluster_status | non-Primary | +----------------------+-------------+
All unresponsive nodes respond with ERROR 1047
when using most statement types in the MySQL client:
mysql> select * from mysql.user; ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
To log in to the proxy dashboard:
Get the proxy dashboard credentials by entering the following URL into your web browser and recording the output:
https://OPS-MAN-FQDN/api/v0/deployed/products/TAS-FOR-VMS-GUID/variables?name=mysql-proxy-dashboard-credentials
Where:
OPS-MAN-FQDN
is the fully-qualified domain name (FQDN) for your Ops Manager deployment.TAS-FOR-VMS-GUID
is the GUID of your TAS for VMs tile. For more information about finding your GUID, see the Tanzu Operations Manager API ReferenceLog in to the proxy dashboard by entering the following URL into your web browser:
https://BOSH-JOB-INDEX-proxy-p-mysql-ert.YOUR-SYSTEM-DOMAIN/
Where:
BOSH-JOB-INDEX
is 0
, 1
, or 2
, representing the three proxies deployed on each node.SYSTEM-DOMAIN
is the System domain you configured in the Domains pane of the BOSH Director tile.Ops Manager includes a BOSH errand that automates the manual bootstrapping procedure in the Bootstrap Manually section below.
It finds the node with the highest transaction sequence number and asks it to start up by itself (i.e. in bootstrap mode) and then asks the remaining nodes to join the cluster.
In most cases, running the errand recovers your cluster. But, certain scenarios require additional steps.
To determine which set of instructions to follow:
List your MySQL instances by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
Where:
YOUR-ENV
is the environment where you deployed the cluster.YOUR-DEPLOYMENT
is the deployment cluster name.For example:
$ bosh -e prod -d mysql instances
Find and record the Process State
for your MySQL instances. In the following example output, the MySQL instances are in the failing
process state.
Instance Process State AZ IPs backup-restore/c635410e-917d-46aa-b054-86d222b6d1c0 running us-central1-b 10.0.4.47 bootstrap/a31af4ff-e1df-4ff1-a781-abc3c6320ed4 - us-central1-b - broker-registrar/1a93e53d-af7c-4308-85d4-3b2b80d504e4 - us-central1-b 10.0.4.58 cf-mysql-broker/137d52b8-a1b0-41f3-847f-c44f51f87728 running us-central1-c 10.0.4.57 cf-mysql-broker/28b463b1-cc12-42bf-b34b-82ca7c417c41 running us-central1-b 10.0.4.56 deregister-and-purge-instances/4cb93432-4d90-4f1d-8152-d0c238fa5aab - us-central1-b - monitoring/f7117dcb-1c22-495e-a99e-cf2add90dea9 running us-central1-b 10.0.4.48 mysql/220fe72a-9026-4e2e-9fe3-1f5c0b6bf09b failing us-central1-b 10.0.4.44 mysql/28a210ac-cb98-4ab4-9672-9f4c661c57b8 failing us-central1-f 10.0.4.46 mysql/c1639373-26a2-44ce-85db-c9fe5a42964b failing us-central1-c 10.0.4.45 proxy/87c5683d-12f5-426c-b925-62521529f64a running us-central1-b 10.0.4.60 proxy/b0115ccd-7973-42d3-b6de-edb5ae53c63e running us-central1-c 10.0.4.61 rejoin-unsafe/8ce9370a-e86b-4638-bf76-e103f858413f - us-central1-b - smoke-tests/e026aaef-efd9-4644-8d14-0811cb1ba733 - us-central1-b 10.0.4.59
Choose your scenario:
failing
state, continue to Scenario 1.-
state, continue to Scenario 2.stopped
state, continue to Scenario 3.In this scenario, the VMs are running, but the cluster has been disrupted.
To bootstrap in this scenario:
Run the bootstrap errand on the VM where the bootstrap errand is co-located, by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT run-errand --instance=VM bootstrap
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.VM
is clock_global/0
if you use TAS for VMs and control/0
if you use Small Footprint TAS for VMs.Note: The errand runs for a long time, during which no output is returned.
The command returns many lines of output, eventually followed by:
Bootstrap errand completed [stderr] + echo 'Started bootstrap errand ...' + JOB\_DIR=/var/vcap/jobs/bootstrap + CONFIG\_PATH=/var/vcap/jobs/bootstrap/config/config.yml + /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml + echo 'Bootstrap errand completed' + exit 0 Errand 'bootstrap' completed successfully (exit code 0)
If the errand fails, run the bootstrap errand command again after a few minutes. The bootstrap errand might not work the first time.
If the errand fails after several tries, bootstrap your cluster manually. See Bootstrap Manually.
In severe circumstances, such as a power failure, it is possible to lose all your VMs. You must re-create them before you can begin recovering the cluster.
When MySQL instances are in the -
state, the VMs are lost. The procedures in this scenario bring the instances from a -
state to a failing
state. Then you run the bootstrap errand similar to Scenario 1 above and restore configuration.
To recover terminated or lost VMs, follow the procedures in the sections below:
Recreate the missing VMs: Bring MySQL instances from a -
state to a failing
state.
Run the Bootstrap Errand: Because your instances are now in the failing
state, continue similarly to Scenario 1 above.
Restore the BOSH Configuration: Go back to unignoring all instances and redeploy. This is a critical and mandatory step.
If you do not set each of your ignored instances to unignore
, your instances are not updated in future deploys. You must follow the procedure in the final section of Scenario 2, Restore the BOSH Configuration.
The procedure in this section uses BOSH to re-create the VMs, install software on them, and try to start the jobs.
The following procedure allows you to:
Redeploy your cluster while expecting the jobs to fail.
Instruct BOSH to ignore the state of each instance in your cluster. This allows BOSH to deploy the software to each instance even if the instance is failing.
To recreate your missing VMs:
If BOSH Resurrector is activated, deactivate it by running:
bosh -e YOUR-ENV update-resurrection off
Where YOUR-ENV
is the name of your environment.
Download the current manifest by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT manifest > /tmp/manifest.yml
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.Redeploy deployment by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy /tmp/manifest.yml
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.Expect one of the MySQL VMs to fail. Deploying causes BOSH to create new VMs and install the software. Forming a cluster is in a subsequent step.
View the instance GUID of the MySQL VM that attempted to start by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.Record the instance GUID, which is the string after mysql/
in your BOSH instances output.
Instruct BOSH to ignore the MySQL VM that just attempted to start, by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/INSTANCE-GUID
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.INSTANCE-GUID
is the GUID of your instance you recorded in the previous step.Repeat steps 3 through 5 until all instances have attempted to start.
If you deactivated the BOSH Resurrector in step 1, re-enable it by running:
bosh -e YOUR-ENV update-resurrection on
Where YOUR-ENV
is the name of your environment.
Confirm that your MySQL instances have gone from the -
state to the failing
state by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.After you recreate the VMs, all instances now have a failing
process state and have the MySQL code. You must run the bootstrap errand to recover the cluster.
To bootstrap:
Run the bootstrap errand by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT run-errand --instance=VM bootstrap
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.VM
is clock_global/0
if you use TAS for VMs and control/0
if you use Small Footprint TAS for VMs.The errand runs for a long time, during which no output is returned.
The command returns many lines of output, eventually with the following successful output:
Bootstrap errand completed [stderr] echo 'Started bootstrap errand ...' JOB\_DIR=/var/vcap/jobs/bootstrap CONFIG\_PATH=/var/vcap/jobs/bootstrap/config/config.yml /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml echo 'Bootstrap errand completed' exit 0 Errand 'bootstrap' completed successfully (exit code 0)
If the errand fails, run the bootstrap errand command again after a few minutes. The bootstrap errand might not work immediately.
See that the errand completes successfully in the shell output and continue to Restore the BOSH Configuration below.
After you complete the bootstrap errand, you might still see instances in the failing
state. Continue to the next section anyway.
If you do not set each of your ignored instances to unignore
, your instances are never updated in future deploys.
To restore your BOSH configuration to its previous state, this procedure unignores each instance that was previously ignored:
For each ignored instance, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT unignore mysql/INSTANCE-GUID
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.INSTANCE-GUID
is the GUID of your instance.Redeploy your deployment by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.Verify that all mysql
instances are in a running
state by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.If the bootstrap errand is not able to automatically recover the cluster, you need to do the steps manually.
Follow the procedures in the sections below to manually bootstrap your cluster.
Important The following procedures are prone to user-error and can result in lost data if followed incorrectly. Follow the procedure in Bootstrap with the BOSH Errand above first, and only resort to the manual process if the errand fails to repair the cluster.
Follow these steps to stop the galera-init
process for each node in the cluster. For each node, record if the monit stop
command was successful:
SSH into the node using the procedure in Advanced troubleshooting with the BOSH CLI.
To shut down the mysqld
process on the node, run:
monit stop galera-init
Record if the monit command succeeds or exits with an error:
galera-init
, then you can use monit to restart this node. Follow all the steps below including the steps marked Monit Restart but omitting the steps marked Manual Redeploy.If monit exits with the following error, then you must manually deploy this node:
Warning: include files not found '/var/vcap/monit/job/*.monitrc' monit: action failed -- There is no service by that nameFollow all the steps below including the steps marked **Manual Redeploy** but omitting the steps marked **Monit Restart**.
Repeat the preceding steps for each node in the cluster.
You cannot bootstrap the cluster unless you have shut down the mysqld
process on all nodes in the cluster.
To avoid losing data, you must bootstrap from a node in the cluster that has the highest transaction sequence number (seqno
).
For each node in the cluster, to find its seqno
:
SSH into the node using the procedure in Advanced troubleshooting with the BOSH CLI.
Find seqno
values in the node’s Galera state file, grastate.dat
by running:
cat /var/vcap/store/pxc-mysql/grastate.dat | grep 'seqno:'
Do one of the following, based on the last and highest seqno
value in the grep
output above:
seqno
is a positive integer, then the node shut down gracefully. Record this number for comparison with the latest seqno
of other nodes in the cluster.If seqno
is -1
, then the node crashed or was stopped. Proceed as follows to attempt to recover the seqno
from the database:
Temporarily start the database and append the last sequence number to its error log by running:
$ /var/vcap/jobs/pxc-mysql/bin/get-sequence-number
The output of the get-sequence-number
utility will look similar to this:
{ "cluster_uuid": "1f594c30-a709-11ed-a00e-5330bbda96d3", "seqno": 4237, "instance_id": "4213a73e-069f-4ac7-b01b-43068ab312b6" }
The seqno
in the above output is 4237
.
After you retrieve the seqno
for all nodes in your cluster, identify the one with the highest seqno
. If multiple nodes share the same highest seqno
, and it is not -1
, you can bootstrap from any of them.
After determining the node with the highest seqno
, do the following to bootstrap the node:
Only run these bootstrap commands on the node with the highest seqno
. Otherwise the node with the highest seqno
is unable to join the new cluster unless its data is abandoned. Its mysqld
process exits with an error.
SSH into the node using the procedure in Advanced troubleshooting with BOSH CLI.
Update the node state to trigger its initialization of the cluster by running:
echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/pxc-mysql/state.txt
Monit Restart: If in Shut Down MySQL above you successfully used monit
to shut down your galera-init
process, then relaunch the mysqld
process on the new bootstrap node.
Start the mysqld
process by running:
monit start galera-init
It can take up to ten minutes for monit
to start the mysqld
process. To confirm if the mysqld
process has started successfully, run:
watch monit summary
If monit succeeds in starting the galera-init process, then the output includes the line Process 'galera-init' running
.
Manual Redeploy: If in Shut Down MySQL above you encountered monit
errors, then redeploy the mysqld
software to your bootstrap node as follows:
Leave the MySQL SSH login shell and return to your local environment.
Target BOSH on your bootstrap node by instructing it to ignore the other nodes in your cluster. For nodes all nodes except the bootstrap node you identified above, run:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/M
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/N
Where N
and M
are the numbers of the non-bootstrapped nodes. For example, if you bootstrap node 0, then M
=1 and N
=2.
Turn off the BOSH Resurrector by running:
bosh update-resurrection off
Use the BOSH manifest to bootstrap your bootstrap machine by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT manifest > /tmp/manifest.yml
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy /tmp/manifest.yml
After the bootstrapped node is running, restart the remaining nodes.
The procedure that you follow for restarting a node, depends on the output you got for that node during Shut Down MySQL. Do one of the following procedures:
If in Shut Down MySQL above you successfully used monit
to shut down your galera-init
process, then restart the nodes as follows:
SSH into the node using the procedure in Advanced troubleshooting with BOSH CLI.
Start the mysqld
process with monit
by running:
monit start galera-init
If the interruptor prevents the node from starting, follow the manual procedure to force the node to rejoin the cluster. See How to manually force a MySQL node to rejoin the HA cluster in the Support knowledge base.
Forcing a node to rejoin the cluster is a destructive procedure. Only follow the procedure with the assistance of Support.
If the monit start
command fails, it might be because the node with the highest seqno
is mysql/0
.
In this case:
Ensure BOSH ignores updating mysql/0
by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/0
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.Go to Ops Manager in a browser, log in, and click Apply Changes.
When the deploy finishes, run the following command from the Ops Manager VM:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT unignore mysql/0
Where:
YOUR-ENV
is the name of your environment.YOUR-DEPLOYMENT
is the name of your deployment.If in Shut Down MySQL you encountered monit
errors, then restart the nodes as follows:
Instruct BOSH to no longer ignore the non-bootstrap nodes in your cluster by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT unignore mysql/M
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT unignore mysql/N
Where N
and M
are the numbers of the non-bootstrapped nodes. For example, if you bootstrap node 0, then M
=1 and N
=2.
Redeploy software to the other two nodes and have them rejoin the cluster, bootstrapped from the node above by running:
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy /tmp/manifest.yml
You only need to run this command once to deploy both the nodes that you unignored in the step above.
With your redeploys completed, turn the BOSH Resurrector back on by running:
bosh -e YOUR-ENV update-resurrection on
The final task is to verify that all the nodes have joined the cluster.
SSH into the bootstrap node then run the following command to output the total number of nodes in the cluster:
mysql> SHOW STATUS LIKE 'wsrep_cluster_size';