This topic provides troubleshooting information for common issues when using VMware SQL with MySQL for Kubernetes
When updating an existing single-node instance to HA, the instance status stays “Pending” and fails to become “Ready”.
$ kubectl get mysql mysql-sample
NAME READY STATUS VERSION AGE
mysql.with.sql.tanzu.vmware.com/mysql-sample false Pending 1.3.0 13m
You may see Events associated with the instance, indicating that the InnoDB Cluster failed to initialize.
For example:
$ kubectl describe mysql mysql-sample
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ClusterTopologyUnknown 2m3s (x25 over 2m26s) innodb-cluster-controller Unable to determine if InnoDB Cluster needs to be created
Warning ClusterInitFailed 17s (x5 over 108s) innodb-cluster-controller Failed to initialize InnoDB Cluster
Examine the instance’s mysql-sidecar container’s logs for an error similar to missing a “Primary Key or equivalent column”.
For example:
$ kubectl logs mysql-sample-0 -c mysql-sidecar
...
ERROR: The following tables do not have a Primary Key or equivalent column:
mysqlappuser_data.t1
...
Your database table was created without a primary key, or a primary-key equivalent (non-null unique key). Such a table is compatible with single-node databases, but is incompatible with the MySQL Group Replication technology used by HA instances. As a requirement of MySQL Group Replication used by HA instances, all tables must have a primary key or equivalent.
To remedy this, add primary keys, or equivalent non-null unique keys, to the tables listed in the error message.
For example:
mysql> ALTER TABLE t1 ADD PRIMARY KEY (a);
Query OK, 0 rows affected (0.08 sec)
Records: 0 Duplicates: 0 Warnings: 0
Without further intervention, the operator should finish updating the instance with HA.
When you list the pods, a pod’s containers might not all be READY. For example, you may see 2/3
containers instead of 3/3
for a data pod:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
mysql-sample-0 2/3 Running 0 2m13s
mysql-sample-1 3/3 Running 1 (100s ago) 2m49s
mysql-sample-2 3/3 Running 1 (79s ago) 2m48s
mysql-sample-proxy-0 1/1 Running 0 2m48s
mysql-sample-proxy-1 1/1 Running 0 68s
When you describe the pod, under Events, the Message might include the cause, for example a failing Readiness probe:
$ kubectl describe pod mysql-sample-0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 21s kubelet Readiness probe failed: Get "http://100.96.1.23:8081/readyz": dial tcp 100.96.1.23:8081: connect: connection refused
Warning Unhealthy 13s (x3 over 20s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Check for any error causes in the logs of the pod’s mysql-sidecar
container:
$ kubectl logs pod mysql-sample-0 -c mysql-sidecar
...
2022-12-22T18:56:16.371Z DEBUG controller-runtime.healthz healthz check failed {"checker": "cluster-member-online", "error": "not online"}
In the affected pod, check the HA topology view:
$ kubectl exec pod/mysql-sample-0 -c mysql-sidecar -- \
mysqlsh --credential-store-helper=tanzumysql \
innodb-cluster-admin@localhost --sql --table \
--execute="select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;"
If the output table indicates that a MEMBER_STATE
is recovering or unreachable, refer to the appropriate section below.
For more details on server states refer to Group Replication Server States.
+---------------------------------------------------------------+------------------+-------------+
| MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE |
+---------------------------------------------------------------+------------------+-------------+
| mysql-sample-0.mysql-sample-members.default.svc.cluster.local | RECOVERING | SECONDARY |
| mysql-sample-1.mysql-sample-members.default.svc.cluster.local | ONLINE | SECONDARY |
| mysql-sample-2.mysql-sample-members.default.svc.cluster.local | ONLINE | PRIMARY |
+---------------------------------------------------------------+------------------+-------------+
This table shows that the MySQL server on pod 0 is still an active member and is currently recovering. Check that it is actually catching up on missed transactions. In the example below, set INSTANCE_NAME
to your MySQL instance name.
$ INSTANCE_NAME=mysql-sample
$ for i in 0 1 2; do kubectl exec pod/$INSTANCE_NAME-$i -c mysql-sidecar -- \
mysqlsh --credential-store-helper=tanzumysql innodb-cluster-admin@localhost --sql --table \
--execute="select @@gtid_executed;"; done
+--------------------------------------------------------------------------------------+
| @@gtid_executed |
+--------------------------------------------------------------------------------------+
| 4becca9a-822a-11ed-9ce4-aeb71ec961d2:1-509125,
4becd020-822a-11ed-9ce4-aeb71ec961d2:1-6 |
+--------------------------------------------------------------------------------------+
+--------------------------------------------------------------------------------------+
| @@gtid_executed |
+--------------------------------------------------------------------------------------+
| 4becca9a-822a-11ed-9ce4-aeb71ec961d2:1-510231,
4becd020-822a-11ed-9ce4-aeb71ec961d2:1-6 |
+--------------------------------------------------------------------------------------+
+--------------------------------------------------------------------------------------+
| @@gtid_executed |
+--------------------------------------------------------------------------------------+
| 4becca9a-822a-11ed-9ce4-aeb71ec961d2:1-510231,
4becd020-822a-11ed-9ce4-aeb71ec961d2:1-6 |
+--------------------------------------------------------------------------------------+
Note that the MySQL server on pod 0 does not yet contain all the transactions executed on the primary (509125
is about 1000 transactions behind 510231
). Assuming there is no replication lag or other issue affecting replication recovery, the MySQL server will eventually fully recover and the Readiness probe will stop reporting errors.
+---------------------------------------------------------------+------------------+-------------+
| MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE |
+---------------------------------------------------------------+------------------+-------------+
| mysql-sample-0.mysql-sample-members.default.svc.cluster.local | UNREACHABLE | SECONDARY |
| mysql-sample-1.mysql-sample-members.default.svc.cluster.local | ONLINE | SECONDARY |
| mysql-sample-2.mysql-sample-members.default.svc.cluster.local | ONLINE | PRIMARY |
+---------------------------------------------------------------+------------------+-------------+
This table shows that the MySQL server on pod 0 is unreachable.
The MySQL servers are configured to use the following settings for recovery:
group_replication_unreachable_majority_timeout = 30
group_replication_autorejoin_tries = 1
group_replication_exit_state_action = ABORT_SERVER
group_replication_member_expel_timeout = 5
For more details refer to Failure Detection and Network Partitioning
The VMware MySQL for Kubernetes sets the unreachable majority timeout to 30 seconds so that a MySQL server with no quorum will exit the group after 30 seconds. The server will set itself to be read-only, and report MEMBER_STATE
as ERROR
in the performance_schema.replication_group_members
table. It will stay in this state independent of the configured exit state action until the number of auto rejoin attempts has been exhausted, at which point the server will evaluate the exit state action. Since the exit state action is set to ABORT_SERVER
, the mysql
container will be terminated and recreated.
Check the logs for another MySQL pod to see the unreachable member is eventually expelled from the group.
$ kubectl logs pod/mysql-sample-2 -c mysql
...
Members removed from the group: mysql-sample-0.mysql-sample-members.default.svc.cluster.local:3306
Group membership changed to mysql-sample-1.mysql-sample-members.default.svc.cluster.local:3306, mysql-sample-2.mysql-sample-members.default.svc.cluster.local:3306 on view 16717353800071640:10.
Check the members table to see the unreachable member removed.
$ kubectl exec pod/mysql-sample-2 -c mysql-sidecar -- \
mysqlsh --credential-store-helper=tanzumysql innodb-cluster-admin@localhost --sql --table \
--execute="select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;"
+---------------------------------------------------------------+------------------+-------------+
| MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE |
+---------------------------------------------------------------+------------------+-------------+
| mysql-sample-1.mysql-sample-members.default.svc.cluster.local | ONLINE | SECONDARY |
| mysql-sample-2.mysql-sample-members.default.svc.cluster.local | ONLINE | PRIMARY |
+---------------------------------------------------------------+------------------+-------------+
The container on pod 0 will restart and in most cases, the MySQL server will be to rejoin the group.
If there is a network issue or other members of the group are also unhealthy, then the MySQL server may not automatically rejoin.
Check the view of the group.
$ kubectl exec pod/mysql-sample-2 -c mysql-sidecar -- \
mysqlsh --credential-store-helper=tanzumysql innodb-cluster-admin@localhost --sql --table \
--execute="select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;"
+---------------------------------------------------------------+--------------+-------------+
| MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE |
+---------------------------------------------------------------+--------------+-------------+
| mysql-sample-0.mysql-sample-members.default.svc.cluster.local | RECOVERING | SECONDARY |
| mysql-sample-1.mysql-sample-members.default.svc.cluster.local | ONLINE | SECONDARY |
| mysql-sample-2.mysql-sample-members.default.svc.cluster.local | ONLINE | PRIMARY |
+---------------------------------------------------------------+--------------+-------------+
Note that the pod 0 has rejoined and is currently recovering. Observe the table to wait for recovery to finish. The Readiness probe will stop reporting errors once the member is online.
If the member continues to report unreachable and is not expelled from the group, then group replication may not be running correctly on the pods.
Check the status of the Group Replication Applier on each MySQL server.
$ INSTANCE_NAME=mysql-sample
$ for i in 0 1 2; do kubectl exec $INSTANCE_NAME-$i -c mysql-sidecar -- \
mysqlsh --credential-store-helper=tanzumysql innodb-cluster-admin@localhost --sql --table \
--execute="SELECT * FROM information_schema.processlist WHERE INFO = 'Group replication applier module'"; done
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| ID | USER | HOST | DB | COMMAND | TIME | STATE | INFO | TIME_MS | ROWS_SENT | ROWS_EXAMINED |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| 20 | system user | | NULL | killed | 6828 | waiting for handler commit | Group replication applier module | 6828439 | 0 | 0 |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| ID | USER | HOST | DB | COMMAND | TIME | STATE | INFO | TIME_MS | ROWS_SENT | ROWS_EXAMINED |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| 12 | system user | | NULL | Connect | 6799 | waiting for handler commit | Group replication applier module | 6799254 | 0 | 0 |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| ID | USER | HOST | DB | COMMAND | TIME | STATE | INFO | TIME_MS | ROWS_SENT | ROWS_EXAMINED |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| 22 | system user | | NULL | Connect | 1172 | waiting for handler commit | Group replication applier module | 1172068 | 0 | 0 |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
If COMMAND
shows Killed
instead of Connect
for the group replication applier module, then group replication is not running.
Fetch the MySQL logs on the pod that is affected. Errors in the logs can help troubleshoot the cause.
$ kubectl logs pod/mysql-sample-0
If recovery is urgent, delete the pod so that the container can restart, and start the MySQL server along with group replication.