Troubleshooting Common Issues

This topic provides troubleshooting information for common issues when using VMware SQL with MySQL for Kubernetes

Errors while updating a single-node instance to HA

Symptoms

When updating an existing single-node instance to HA, the instance status stays “Pending” and fails to become “Ready”.

$ kubectl get mysql mysql-sample
NAME                                           READY   STATUS    VERSION  AGE
mysql.with.sql.tanzu.vmware.com/mysql-sample   false   Pending   1.3.0    13m

You may see Events associated with the instance, indicating that the InnoDB Cluster failed to initialize.

For example:

$ kubectl describe mysql mysql-sample
...
Events:
  Type     Reason                  Age                    From                       Message
  ----     ------                  ----                   ----                       -------
  Warning  ClusterTopologyUnknown  2m3s (x25 over 2m26s)  innodb-cluster-controller  Unable to determine if InnoDB Cluster needs to be created
  Warning  ClusterInitFailed       17s (x5 over 108s)     innodb-cluster-controller  Failed to initialize InnoDB Cluster

Diagnosis

Examine the instance’s mysql-sidecar container’s logs for an error similar to missing a “Primary Key or equivalent column”.

For example:

$ kubectl logs mysql-sample-0 -c mysql-sidecar
...
ERROR: The following tables do not have a Primary Key or equivalent column:
mysqlappuser_data.t1
...

Your database table was created without a primary key, or a primary-key equivalent (non-null unique key). Such a table is compatible with single-node databases, but is incompatible with the MySQL Group Replication technology used by HA instances. As a requirement of MySQL Group Replication used by HA instances, all tables must have a primary key or equivalent.

Remedy

To remedy this, add primary keys, or equivalent non-null unique keys, to the tables listed in the error message.

For example:

mysql> ALTER TABLE t1 ADD PRIMARY KEY (a);
Query OK, 0 rows affected (0.08 sec)
Records: 0  Duplicates: 0  Warnings: 0

Without further intervention, the operator should finish updating the instance with HA.

Pod containers not READY in a HA topology

Symptoms

When you list the pods, a pod’s containers might not all be READY. For example, you may see 2/3 containers instead of 3/3 for a data pod:

$ kubectl get pods

NAME                   READY   STATUS    RESTARTS       AGE
mysql-sample-0         2/3     Running   0              2m13s
mysql-sample-1         3/3     Running   1 (100s ago)   2m49s
mysql-sample-2         3/3     Running   1 (79s ago)    2m48s
mysql-sample-proxy-0   1/1     Running   0              2m48s
mysql-sample-proxy-1   1/1     Running   0              68s

When you describe the pod, under Events, the Message might include the cause, for example a failing Readiness probe:

$ kubectl describe pod mysql-sample-0

...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Warning  Unhealthy  21s                kubelet            Readiness probe failed: Get "http://100.96.1.23:8081/readyz": dial tcp 100.96.1.23:8081: connect: connection refused
  Warning  Unhealthy  13s (x3 over 20s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

Check for any error causes in the logs of the pod’s mysql-sidecar container:

$ kubectl logs pod mysql-sample-0 -c mysql-sidecar

...
2022-12-22T18:56:16.371Z	DEBUG	controller-runtime.healthz	healthz check failed	{"checker": "cluster-member-online", "error": "not online"}

In the affected pod, check the HA topology view:

$ kubectl exec pod/mysql-sample-0 -c mysql-sidecar -- \
     mysqlsh --credential-store-helper=tanzumysql \
     innodb-cluster-admin@localhost --sql --table \
     --execute="select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;"

If the output table indicates that a MEMBER_STATE is recovering or unreachable, refer to the appropriate section below.

For more details on server states refer to Group Replication Server States.

Diagnosis: MEMBER_STATE recovering

+---------------------------------------------------------------+------------------+-------------+
| MEMBER_HOST                                                   | MEMBER_STATE     | MEMBER_ROLE |
+---------------------------------------------------------------+------------------+-------------+
| mysql-sample-0.mysql-sample-members.default.svc.cluster.local | RECOVERING       | SECONDARY   |
| mysql-sample-1.mysql-sample-members.default.svc.cluster.local | ONLINE           | SECONDARY   |
| mysql-sample-2.mysql-sample-members.default.svc.cluster.local | ONLINE           | PRIMARY     |
+---------------------------------------------------------------+------------------+-------------+

This table shows that the MySQL server on pod 0 is still an active member and is currently recovering. Check that it is actually catching up on missed transactions. In the example below, set INSTANCE_NAME to your MySQL instance name.

$ INSTANCE_NAME=mysql-sample
$ for i in 0 1 2; do kubectl exec pod/$INSTANCE_NAME-$i -c mysql-sidecar -- \
    mysqlsh --credential-store-helper=tanzumysql innodb-cluster-admin@localhost --sql --table \
      --execute="select @@gtid_executed;"; done
+--------------------------------------------------------------------------------------+
| @@gtid_executed                                                                      |
+--------------------------------------------------------------------------------------+
| 4becca9a-822a-11ed-9ce4-aeb71ec961d2:1-509125,
4becd020-822a-11ed-9ce4-aeb71ec961d2:1-6 |
+--------------------------------------------------------------------------------------+
+--------------------------------------------------------------------------------------+
| @@gtid_executed                                                                      |
+--------------------------------------------------------------------------------------+
| 4becca9a-822a-11ed-9ce4-aeb71ec961d2:1-510231,
4becd020-822a-11ed-9ce4-aeb71ec961d2:1-6 |
+--------------------------------------------------------------------------------------+
+--------------------------------------------------------------------------------------+
| @@gtid_executed                                                                      |
+--------------------------------------------------------------------------------------+
| 4becca9a-822a-11ed-9ce4-aeb71ec961d2:1-510231,
4becd020-822a-11ed-9ce4-aeb71ec961d2:1-6 |
+--------------------------------------------------------------------------------------+

Note that the MySQL server on pod 0 does not yet contain all the transactions executed on the primary (509125 is about 1000 transactions behind 510231). Assuming there is no replication lag or other issue affecting replication recovery, the MySQL server will eventually fully recover and the Readiness probe will stop reporting errors.

Diagnosis: MEMBER_STATE unreachable

+---------------------------------------------------------------+------------------+-------------+
| MEMBER_HOST                                                   | MEMBER_STATE     | MEMBER_ROLE |
+---------------------------------------------------------------+------------------+-------------+
| mysql-sample-0.mysql-sample-members.default.svc.cluster.local | UNREACHABLE      | SECONDARY   |
| mysql-sample-1.mysql-sample-members.default.svc.cluster.local | ONLINE           | SECONDARY   |
| mysql-sample-2.mysql-sample-members.default.svc.cluster.local | ONLINE           | PRIMARY     |
+---------------------------------------------------------------+------------------+-------------+

This table shows that the MySQL server on pod 0 is unreachable.

The MySQL servers are configured to use the following settings for recovery:

group_replication_unreachable_majority_timeout = 30
group_replication_autorejoin_tries             = 1
group_replication_exit_state_action            = ABORT_SERVER
group_replication_member_expel_timeout         = 5

For more details refer to Failure Detection and Network Partitioning

The VMware MySQL for Kubernetes sets the unreachable majority timeout to 30 seconds so that a MySQL server with no quorum will exit the group after 30 seconds. The server will set itself to be read-only, and report MEMBER_STATE as ERROR in the performance_schema.replication_group_members table. It will stay in this state independent of the configured exit state action until the number of auto rejoin attempts has been exhausted, at which point the server will evaluate the exit state action. Since the exit state action is set to ABORT_SERVER, the mysql container will be terminated and recreated.

Check the logs for another MySQL pod to see the unreachable member is eventually expelled from the group.

$ kubectl logs pod/mysql-sample-2 -c mysql
...
Members removed from the group: mysql-sample-0.mysql-sample-members.default.svc.cluster.local:3306
Group membership changed to mysql-sample-1.mysql-sample-members.default.svc.cluster.local:3306, mysql-sample-2.mysql-sample-members.default.svc.cluster.local:3306 on view 16717353800071640:10.

Check the members table to see the unreachable member removed.

$ kubectl exec pod/mysql-sample-2 -c mysql-sidecar -- \
    mysqlsh --credential-store-helper=tanzumysql innodb-cluster-admin@localhost --sql --table \
      --execute="select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;"
+---------------------------------------------------------------+------------------+-------------+
| MEMBER_HOST                                                   | MEMBER_STATE     | MEMBER_ROLE |
+---------------------------------------------------------------+------------------+-------------+
| mysql-sample-1.mysql-sample-members.default.svc.cluster.local | ONLINE           | SECONDARY   |
| mysql-sample-2.mysql-sample-members.default.svc.cluster.local | ONLINE           | PRIMARY     |
+---------------------------------------------------------------+------------------+-------------+

The container on pod 0 will restart and in most cases, the MySQL server will be to rejoin the group.

If there is a network issue or other members of the group are also unhealthy, then the MySQL server may not automatically rejoin.

Check the view of the group.

$ kubectl exec pod/mysql-sample-2 -c mysql-sidecar -- \
    mysqlsh --credential-store-helper=tanzumysql innodb-cluster-admin@localhost --sql --table \
      --execute="select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;"
+---------------------------------------------------------------+--------------+-------------+
| MEMBER_HOST                                                   | MEMBER_STATE | MEMBER_ROLE |
+---------------------------------------------------------------+--------------+-------------+
| mysql-sample-0.mysql-sample-members.default.svc.cluster.local | RECOVERING   | SECONDARY   |
| mysql-sample-1.mysql-sample-members.default.svc.cluster.local | ONLINE       | SECONDARY   |
| mysql-sample-2.mysql-sample-members.default.svc.cluster.local | ONLINE       | PRIMARY     |
+---------------------------------------------------------------+--------------+-------------+

Note that the pod 0 has rejoined and is currently recovering. Observe the table to wait for recovery to finish. The Readiness probe will stop reporting errors once the member is online.

Remedy

If the member continues to report unreachable and is not expelled from the group, then group replication may not be running correctly on the pods.

Check the status of the Group Replication Applier on each MySQL server.

$ INSTANCE_NAME=mysql-sample
$ for i in 0 1 2; do kubectl exec $INSTANCE_NAME-$i -c mysql-sidecar -- \
    mysqlsh --credential-store-helper=tanzumysql innodb-cluster-admin@localhost --sql --table \
      --execute="SELECT * FROM information_schema.processlist WHERE INFO = 'Group replication applier module'"; done

+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| ID | USER        | HOST | DB   | COMMAND | TIME | STATE                      | INFO                             | TIME_MS | ROWS_SENT | ROWS_EXAMINED |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| 20 | system user |      | NULL | killed | 6828 | waiting for handler commit | Group replication applier module | 6828439 |         0 |             0 |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| ID | USER        | HOST | DB   | COMMAND | TIME | STATE                      | INFO                             | TIME_MS | ROWS_SENT | ROWS_EXAMINED |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| 12 | system user |      | NULL | Connect | 6799 | waiting for handler commit | Group replication applier module | 6799254 |         0 |             0 |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| ID | USER        | HOST | DB   | COMMAND | TIME | STATE                      | INFO                             | TIME_MS | ROWS_SENT | ROWS_EXAMINED |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+
| 22 | system user |      | NULL | Connect | 1172 | waiting for handler commit | Group replication applier module | 1172068 |         0 |             0 |
+----+-------------+------+------+---------+------+----------------------------+----------------------------------+---------+-----------+---------------+

If COMMAND shows Killed instead of Connect for the group replication applier module, then group replication is not running.

Fetch the MySQL logs on the pod that is affected. Errors in the logs can help troubleshoot the cause.

$ kubectl logs pod/mysql-sample-0

If recovery is urgent, delete the pod so that the container can restart, and start the MySQL server along with group replication.