Troubleshooting VMware SQL with MySQL for Tanzu Application Service

This topic provides you with basic instructions for troubleshooting on-demand VMware SQL with MySQL for Tanzu Application Service.

For information about temporary VMware Tanzu for MySQL service interruptions, see Service interruptions.

Troubleshoot errors

This section provides information on how to troubleshoot specific errors or error messages.

Common services errors

The following errors occur in multiple services:

Failed installation
Cannot create or delete service instances
Broker request timeouts
Instance does not exist
Cannot bind to or unbind from service instances
Cannot connect to a service instance
Upgrade all service instances errand fails
Missing logs and metrics
MySQL Load is high with large number of CredHub encryption keys

Failed Installation
Symptom	VMware SQL with MySQL for TAS fails to install.
Cause	Reasons for a failed installation include: Certificate issues: The on-demand broker (ODB) requires valid certificates. Deploy fails. This could be due to a variety of reasons. Networking problems: Cloud Foundry cannot reach the VMware SQL with MySQL for TAS broker Cloud Foundry cannot reach the service instances The service network cannot access the BOSH director The Register broker errand fails. The smoke test errand fails. Resource sizing issues: These occur when the resource sizes selected for a given plan are less than VMware SQL with MySQL for TAS requires to function. Other service-specific issues.
Solution	To troubleshoot: Certificate issues: Ensure that your certificates are valid and generate new ones if necessary. To generate new certificates, contact Support. Deploy fails: View the logs using Tanzu Operations Manager to determine why the deploy is failing. Networking problems: For how to troubleshoot, see Networking problems. Register broker errand fails: For how to troubleshoot, see Register broker errand. Resource sizing issues: Check your resource configuration in Tanzu Operations Manager and ensure that the configuration matches that recommended by the service.

Cannot Create or Delete Service Instances
Symptom	Developers report errors such as: Instance provisioning failed: There was a problem completing your request. Contact your operations team providing the following information: service: redis-acceptance, service-instance-guid: ae9e232c-0bd5-4684-af27-1b08b0c70089, broker-request-id: 63da3a35-24aa-4183-aec6-db8294506bac, task-id: 442, operation: create
Cause	Reasons include: Problems with the deployment manifest Authentication errors Network errors Quota errors
Solution	To troubleshoot: If the BOSH error shows a problem with the deployment manifest, open the manifest in a text editor to inspect it. To continue troubleshooting, SSH Into the BOSH Director VM and target the VMware SQL with MySQL for TAS instance using the instructions on parsing a Cloud Foundry error message. Retrieve the BOSH task ID from the error message and run the following command: `bosh task TASK-ID` If you need more information, access the broker logs and use the `broker-request-id` from the previous error message to search the logs for more information. Search for: Authentication errors Network errors Quota errors

Broker Request Timeouts
Symptom	If developers report errors such as: Server error, status code: 504, error code: 10001, message: The request to the service broker timed out: https://BROKER-URL/v2/service_instances/e34046d3-2379-40d0-a318-d54fc7a5b13f/service_bindings/aa635a3b-ef6d-41c3-a23f-55752f3f651b
Cause	Cloud Foundry might not be connected to the service broker, or there might be a large number of queued tasks.
Solution	To troubleshoot: Confirm that Cloud Foundry (CF) is connected to the service broker. Check the BOSH queue size: Log in to BOSH as an admin. Run `bosh tasks` If there are a large number of queued tasks, the system might be overloaded. BOSH is configured with two workers and one status worker, which might not be sufficient resources for the level of load. If the task queue is long, advise the app developers to try again once the system is under less of a load.

Instance Does Not Exist
Symptom	If developers report errors such as: Server error, status code: 502, error code: 10001, message: Service broker error: instance does not exist`
Cause	The instance might have been deleted.
Solution	To troubleshoot: Confirm that the VMware SQL with MySQL for TAS instance exists in BOSH and obtain the GUID CF by running: `cf service MY-INSTANCE --guid` Using the GUID that you obtained previously, run: `bosh -d service-instance_GUID vms` If the BOSH deployment is not found, it has been deleted from BOSH. Contact Support for further assistance.

Cannot Bind to or Unbind from Service Instances
Symptom	If developers report errors such as: Server error, status code: 502, error code: 10001, message: Service broker error: There was a problem completing your request. Please contact your operations team providing the following information: service: example-service, service-instance-guid: 8d69de6c-88c6-4283-b8bc-1c46103714e2, broker-request-id: 15f4f87e-200a-4b1a-b76c-1c4b6597c2e1, operation: bind
Cause	This might be due to authentication or network errors.
Solution	To find out the exact issue with the binding process: Access the service broker logs. Search the logs for the `broker-request-id` string listed in the error message. Check for: Authentication errors Network errors Contact Support for further assistance if you are unable to resolve the problem.

Cannot Connect to a Service Instance
Symptom	Developers report that their app cannot use service instances that they have successfully created and bound.
Cause	The error might originate from the service or be network related.
Solution	To solve this issue, ask the user to send application logs that show the connection error. If the error originates from the service, then follow VMware SQL with MySQL for TAS-specific instructions. If the issue appears to be network-related, then: Check that application security groups are configured correctly. Access can be configured for the service network that the tile is deployed to. Ensure that the network the TAS for VMs tile is deployed to has network access to the service network. You can find the network definition for this service network in the BOSH Director tile. In Ops Manager go into the service tile and see the service network that is configured in the Networks tab. In Ops Manager go into the TAS for VMs tile and see the network it is assigned to. Make sure that these networks can access each other.

Service instances can also become temporarily inaccessible during upgrades and VM or network failures. See Service interruptions for more information.

Upgrade All Service Instances Errand Fails
Symptom	The `upgrade-all-service-instances` errand fails.
Cause	There might be a problem with a particular instance.
Solution	To troubleshoot: Look at the errand output in the Ops Manager log. If an instance has failed to upgrade, debug and fix it before running the errand again to prevent any failure issues from spreading to other on-demand instances. After the Ops Manager log no longer lists the deployment as `failing`, re-run the errand to upgrade the rest of the instances.

Missing Logs and Metrics
Symptom	No logs are being emitted by the on-demand broker.
Cause	Syslog might not be configured correctly, or you might have network access issues.
Solution	To troubleshoot: Ensure you have configured syslog for the tile. Check that your syslog forwarding address is correct in Ops Manager. Ensure that you have network connectivity between the networks that the tile is using and the syslog destination. If the destination is external, you need to use the public ip VM extension feature available in your Ops Manager tile configuration settings. Verify that Loggregator is emitting metrics: Install the `cf log-cache` plug-in. For instructions, see the Log Cache CLI Plug-in GitHub repository. Find logs from your service instance by running: `cf tail -f SERVICE_INSTANCE` If no metrics appear within five minutes, verify that the broker network has access to the Loggregator system on all required ports. If you are unable to resolve the issue, contact Support.

MySQL Load is High with Large Number of CredHub Encryption Keys
Symptom	MySQL load is high Slow CredHub queries
Cause	Large number of CredHub encryption keys
Solution	To troubleshoot: For information about resolving this issue, see Cleaning up Tanzu Application Service Credhub.

Leader-Follower Service Instance Errors

This section provides solutions for the following errands:

Unable to determine leader and follower
Both leader and follower instances are writable
Both leader and follower instances are read-only

Unable to Determine Leader and Follower
Symptom	This problem happens when the `configure-leader-follower` errand fails because it cannot determine the VM roles. The `configure-leader-follower` errand exits with `1` and the errand logs contain the following: $ Unable to determine leader and follower based on transaction history.
Cause	Something has happened to the instances, such as a failure or manual intervention. As a result, there is not enough information available to determine the correct state and topology without operator intervention to resolve the issue.
Solution	Use the `inspect` errand to determine which instance can be the leader. Then, using the orchestration errands and backup/restore, you can put the service instance into a safe topology, and then rerun the `configure-leader-follower` errand. This is shown in the following example. This example shows one outcome that the `inspect` errand can return: Use the `inspect` errand to retrieve relevant information about the two VMs: $ bosh -e my-env -d my-dep run-errand inspect [...] Instance mysql/4ecad54b-0704-47eb-8eef-eb228cab9724 Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https://127.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: false Read Only: true GTID Executed: 1d774323-de9e-11e7-be01-42010a001014:1-25 Replication Configured: false Instance mysql/e0b94ade-0114-4d49-a929-ce1616d8beda Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https://127.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: true Read Only: true GTID Executed: 1d774323-de9e-11e7-be01-42010a001014:1-25 Replication Configured: true 2 errand(s) Succeeded In the previous scenario, the first instance is missing data but does not have replication configured. The second instance has data, and also has replication configured. The following instructions resolve this by copying data to the first instance, and resuming replication. Take a backup of the second instance using the Create a VMware SQL with MySQL for TAS Logical Backup steps. Restore the backup artifact to the first instance using the Restore from a VMware SQL with MySQL for TAS Logical Backup steps. At this point, the instances have equivalent data. Run the `configure-leader-follower` errand to reconfigure replication: `bosh -e ENVIRONMENT -d DEPLOYMENT \ run-errand configure-leader-follower \ --instance=mysql/GUID-OF-LEADER` For example: $ bosh -e my-env -d my-dep \ run-errand configure-leader-follower \ --instance=mysql/4ecad54b-0704-47eb-8eef-eb228cab9724

Both Leader and Follower Instances are Writable
Symptom	This problem happens when the `configure-leader-follower` errand fails because both VMs are writable and the VMs might hold differing data. The `configure–leader-follower` errand exits with `1` and the errand logs contain the following: $ Both mysql instances are writable. Please ensure no divergent data and set one instance to read-only mode.
Cause	VMware SQL with MySQL for TAS tries to ensure that there is only one writable instance of the leader-follower pair at any given time. However, in certain situations, such as network partitions, or manual intervention outside of the provided bosh errands, it is possible for both instances to be writable. The service instances remain in this state until an operator resolves the issue to ensure that the correct instance is promoted and reduce the potential for data divergence.
Solution	Use the `inspect` errand to retrieve the GTID Executed set for each VM: $ bosh -e my-env -d my-dep run-errand inspect [...] Instance mysql/4ecad54b-0704-47eb-8eef-eb228cab9724 Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https:127.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: true Read Only: false GTID Executed: 1d774323-de9e-11e7-be01-42010a001014:1-23 Replication Configured: false Instance mysql/e0b94ade-0114-4d49-a929-ce1616d8beda Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https:127.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: true Read Only: false GTID Executed: 1d774323-de9e-11e7-be01-42010a001014:1-25 Replication Configured: false 2 errand(s) Succeeded If the GTID Executed sets for both instances are the same, continue to Step 2. If they are different, continue to Step 4. Look at the value of GTID Executed for both instances. If the range after the GUID is equivalent, either instance can be made read-only, as described in Step 3. If one instance has a range that is a subset of the other, the instance with the subset must be made read-only, as described in Step 3. Based on the information you gathered in the previous step, run the `make-read-only` errand to make the appropriate instance read-only: `bosh -e ENVIRONMENT -d DEPLOYMENT \ run-errand make-read-only \ --instance=mysql/MYSQL-SUBSET-INSTANCE` For example: $ bosh -e my-env -d my-dep \ run-errand make-read-only \ --instance=mysql/e0b94ade-0114-4d49-a929-ce1616d8beda [...] succeeded If the GTID Executed sets are neither equivalent nor subsets, data has diverged and you must determine what data has diverged as part of the following procedure: Use the `make-read-only` errand to set both instances to read-only to prevent further data divergence. `bosh -e ENVIRONMENT -d DEPLOYMENT \ run-errand make-read-only \ --instance=mysql/MYSQL-INSTANCE` For example: $ bosh -e my-env -d my-dep \ run-errand make-read-only \ --instance=mysql/e0b94ade-0114-4d49-a929-ce1616d8beda [...] succeeded Take a backup of both instances using the Create a VMware SQL with MySQL for TAS Logical Backup steps. Manually inspect the data on each instance to determine the discrepancies and put the data on the instance that is further ahead—this instance has the higher GTID Executed set, and is the new leader. Migrate all appropriate data to the new leader instance. After putting all data on the leader, ssh onto the follower: `bosh -e ENVIRONMENT -d DEPLOYMENT ssh mysql/GUID-OF-FOLLOWER` For example: $ bosh -e my-env -d my-dep ssh mysql/e0b94ade-0114-4d49-a929-ce1616d8beda Become root with the command `sudo su`. Stop the mysql process with the command `monit stop mysql`. Delete the data directory of the follower with the command `rm -rf /var/vcap/store/mysql`. Start the mysql process with the command `monit start mysql`. Use the `configure-leader-follower` errand to copy the leader data to the follower and resume replication: `bosh -e ENVIRONMENT -d DEPLOYMENT \ run-errand configure-leader-follower \ --instance=mysql/GUID-OF-LEADER` For example: $ bosh -e my-env -d my-dep \ run-errand configure-leader-follower \ --instance=mysql/4ecad54b-0704-47eb-8eef-eb228cab9724

Both Leader and Follower Instances are Read-Only
Symptom	Developers report that apps cannot write to the database. In a leader-follower topology, the leader VM is writable and the follower VM is read-only. However, if both VMs are read-only, apps cannot write to the database.
Cause	This problem happens if the leader VM fails and the BOSH Resurrector is activated. When the leader is resurrected, it is set as read-only.
Solution	Use the `inspect` errand to confirm that both VMs are in a read-only state: `bosh -e ENVIRONMENT -d DEPLOYMENT run-errand inspect` Examine the output and locate the information about the leader-follower VMware SQL with MySQL for TAS VMs: Instance mysql/4eexample54b-0704-47eb-8eef-eb2example724 Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https:999.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: true Read Only: true GTID Executed: 1d779999-de9e-11e7-be01-42010a009999:1-23 Replication Configured: true Instance mysql/e0exampleade-0114-4d49-a929-cexample8beda Exit Code 0 Stdout - Stderr 2017/12/11 18:25:54 Started executing command: inspect 2017/12/11 18:25:54 Started GET https:999.0.0.1:8443/status 2017/12/11 18:25:54 Has Data: true Read Only: true GTID Executed: 1d779999-de9e-11e7-be01-42010a009999:1-25 Replication Configured: false 2 errand(s) Succeeded If Read Only is set to `true` for both VMs, make the leader writable using the following command: `bosh -e ENVIRONMENT -d DEPLOYMENT \ run-errand configure-leader-follower \ --instance=mysql/GUID-OF-LEADER` For example, if the second instance is the leader: $ bosh -e my-env -d my-dep \ run-errand configure-leader-follower \ --instance=mysql/e0exampleade-0114-4d49-a929-cexample8beda

Inoperable app and database errors

This section provides a solution for the following errors:

Persistent Disk is Full
Cannot Access Database Table

Persistent Disk is Full
Symptom	Developers report that read, write, and cf CLI operations do not work. Developers cannot upgrade to a larger VMware SQL with MySQL for TAS service plan to free up disk space. If your persistent disk is full, apps become inoperable. In this state, read, write, and Cloud Foundry Command-Line Interface (cf CLI) operations do not work.
Cause	This problem happens if your persistent disk is full. When you use the BOSH CLI to target your deployment, you see that instances are at 100% persistent disk usage. Available disk space can be increased by deleting log files. After deleting logs, you can then upgrade to a larger VMware SQL with MySQL for TAS service plan. You can also turn off binary logging before developers do large data uploads or if their databases have a high transaction volume.
Solution	To resolve this issue, do one of the following: If your persistent disk is already full, delete binary logs. See MySQL for Tanzu Application Service hangs when server VM persistent disk is full. Caution Deleting binary logs is a destructive procedure and can result in MySQL data loss. Only do this procedure with the assistance of Support. If the majority of your persistent disk are binary logs but it is not currently full, turn off binary logging. See Turn off Binary Logs Filling up the Persistent Disk.

Cannot Access Database Table
Symptom	When you query an existing table, you see an error similar to the following: ERROR 1146 (42S02): Table 'mysql.foobar' doesn't exist
Cause	This error occurs if you created an uppercase table name and then activated lowercase table names. You activate lowercase table names either by: Setting the optional `enable_lower_case_table_names` parameter to `true` with the cf CLI. For more information about the parameter, see Lowercase table names. Selecting Enable Lower Case Table Names in the Mysql Configuration pane of the tile. For more information about this configuration, see Configure MySQL.
Solution	To resolve this issue: Deactivate lowercase table names by doing one of the following: Set the optional `enable_lower_case_table_names` parameter to `false` with the cf CLI. For instructions, see Set Optional Parameters. Activate lowercase table names in the tile: Deselect Enable Lower Case Table Names in the Mysql Configuration pane of the tile. Go to the Ops Manager Installation Dashboard, click Review Pending Changes, and then click Apply Changes. (Optional) If you want to activate lowercase table names again, rename your table to lowercase and then activate lowercase table names.

Cannot Access Database Table

Symptom

When you query an existing table, you see an error similar to the following:

ERROR 1146 (42S02): Table 'mysql.foobar' doesn't exist

Cause

This error occurs if you created an uppercase table name and then activated lowercase table names.

You activate lowercase table names either by:

Setting the optional enable_lower_case_table_names parameter to true with the cf CLI. For more information about the parameter, see Lowercase table names.
Selecting Enable Lower Case Table Names in the Mysql Configuration pane of the tile. For more information about this configuration, see Configure MySQL.

Solution

To resolve this issue:

Deactivate lowercase table names by doing one of the following:
- Set the optional enable_lower_case_table_names parameter to false with the cf CLI. For instructions, see Set Optional Parameters.
- Activate lowercase table names in the tile:
  1. Deselect Enable Lower Case Table Names in the Mysql Configuration pane of the tile.
  2. Go to the Ops Manager Installation Dashboard, click Review Pending Changes, and then click Apply Changes.
(Optional) If you want to activate lowercase table names again, rename your table to lowercase and then activate lowercase table names.

Highly available cluster errors

This section provides solutions for the following errands:

Unresponsive Node in a Highly Available Cluster
Many Replication Errors in Logs for Highly Available Clusters

Unresponsive Node in a Highly Available Cluster
Symptom	A client connected to a VMware SQL with MySQL for TAS cluster node reports the following error: WSREP has not yet prepared this node for application use Some clients might instead return the following: unknown error
Cause	If the client is connected to a VMware SQL with MySQL for TAS cluster node and that node loses connection to the rest of the cluster, the node stops accepting writes. If the connection to this node is made through the proxy, the proxy automatically re-routes further connections to a different node.
Solution	A node can become unresponsive for a number of reasons. For solutions, see the following: Network Latency: If network latency causes a node to become unresponsive, the node drops but eventually rejoins. The node automatically rejoins only if one node has left the cluster. Consult your IaaS network settings to reduce your network latency. MySQL Process Failure: If the MySQL process fails, `monit` then BOSH restores the process. If the process is not restored, use "bosh logs" to retrieve logs from the failing database or mysql jobs, and inspect the error logs returned. For more information, see the Downloading logs section. Firewall Rule Change: If your firewall rules change, it might prevent a node from communicating with the rest of the cluster. This causes the node to become unresponsive. In this case, the logs show the node leaving the cluster but do not show network latency errors. To confirm that the node is unresponsive because of a firewall rule change, SSH from a responsive node to the unresponsive node. If you cannot connect, the node is unresponsive due to a firewall rule change. Change your firewall rules to activate the unresponsive node to rejoin the cluster. VM Failure: If you cannot SSH into a node and you are not detecting either network latency or firewall issues, your node might be down due to VM failure. To confirm that the node is unresponsive and recreate the VM, see Recreate a Corrupted VM in a Highly Available. Node Unable to Rejoin: If a detached existing node fails to join the cluster, its `sequence_number` might be higher than those of the nodes with quorum. A higher `sequence_number` on the detached node indicates that it has recent changes to the data that the primary component lacks. You can verify this by looking at the node’s error log at `/var/vcap/sys/log/pxc-mysql/mysql.err.log`. To restore the cluster, complete one of the following: If the detached node has a higher sequence number than the primary component, do the procedures in Bootstrapping. If bootstrapping does not restore the cluster, you can manually force the node to rejoin the cluster. This removes all of the unsynchronized data from the detached server node and creates a new copy of the cluster data on the node. For more information, see Force a node to rejoin a highly available cluster manually. Forcing a node to rejoin the cluster is a destructive procedure. Only do this procedure with the assistance of Support.

Unresponsive Node in a Highly Available Cluster

Symptom

A client connected to a VMware SQL with MySQL for TAS cluster node reports the following error:

WSREP has not yet prepared this node for application use

Some clients might instead return the following:

      unknown error

Cause If the client is connected to a VMware SQL with MySQL for TAS cluster node and that node loses connection to the rest of the cluster, the node stops accepting writes. If the connection to this node is made through the proxy, the proxy automatically re-routes further connections to a different node.

Solution

A node can become unresponsive for a number of reasons. For solutions, see the following:

Network Latency: If network latency causes a node to become unresponsive, the node drops but eventually rejoins. The node automatically rejoins only if one node has left the cluster. Consult your IaaS network settings to reduce your network latency.
MySQL Process Failure: If the MySQL process fails, monit then BOSH restores the process. If the process is not restored, use "bosh logs" to retrieve logs from the failing database or mysql jobs, and inspect the error logs returned. For more information, see the Downloading logs section.
Firewall Rule Change: If your firewall rules change, it might prevent a node from communicating with the rest of the cluster. This causes the node to become unresponsive. In this case, the logs show the node leaving the cluster but do not show network latency errors.

To confirm that the node is unresponsive because of a firewall rule change, SSH from a responsive node to the unresponsive node. If you cannot connect, the node is unresponsive due to a firewall rule change. Change your firewall rules to activate the unresponsive node to rejoin the cluster.
VM Failure: If you cannot SSH into a node and you are not detecting either network latency or firewall issues, your node might be down due to VM failure. To confirm that the node is unresponsive and recreate the VM, see Recreate a Corrupted VM in a Highly Available.
Node Unable to Rejoin: If a detached existing node fails to join the cluster, its sequence_number might be higher than those of the nodes with quorum. A higher sequence_number on the detached node indicates that it has recent changes to the data that the primary component lacks. You can verify this by looking at the node’s error log at /var/vcap/sys/log/pxc-mysql/mysql.err.log.

To restore the cluster, complete one of the following:
- If the detached node has a higher sequence number than the primary component, do the procedures in Bootstrapping.
- If bootstrapping does not restore the cluster, you can manually force the node to rejoin the cluster. This removes all of the unsynchronized data from the detached server node and creates a new copy of the cluster data on the node. For more information, see Force a node to rejoin a highly available cluster manually.
  Forcing a node to rejoin the cluster is a destructive procedure. Only do this procedure with the assistance of Support.

Many Replication Errors in Logs for Highly Available Clusters

Symptom

You see many replication errors in the MySQL logs, like the following:

    160318 9:25:16 [Warning] WSREP: RBR event 1 Query apply warning: 1, 16992456
    160318 9:25:16 [Warning] WSREP: Ignoring error for TO isolated action: source: abcd1234-abcd-1234-abcd-1234abcd1234 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 246804 trx_id: -1 seqnos (l: 865022, g: 16992456, s: 16992455, d: 16992455, ts: 2530660989030983)
    160318 9:25:16 [ERROR] Slave SQL: Error 'Duplicate column name 'number'' on query. Default database: 'cf_0123456_1234_abcd_1234_abcd1234abcd'. Query: 'ALTER TABLE ...'

Cause This problem happens when there are errors in SQL statements.

Solution

For solutions for the replication errors in MySQL log files, see the following table:

Additional Error	Solution
`ALTER TABLE` errors	Fix the `ALTER TABLE` error. This error can occur when an app issues an invalid data definition statement. Other nodes log this problem as a replication error because they fail to replicate the `ALTER TABLE`.

If you see replication errors, but no ALTER TABLE or persistent disk or memory issues, you can ignore the replication errors.

Failed backups

If an automated backup or a backup initiated from the ApplicationDataBackupRestore (adbr) plug-in fails, verify that the 2345 port from the TAS for VMs to the ODB component is open.

Automated Backups or adbr Plug-in Backups Fail
Symptom	The following are true: The backup fails. The adbr-api logs for the broker show: backup failed with response: 502 Bad Gateway: Registered endpoint failed to handle the request. The gorouter logs on the TAS for VMs deployment show: adbr-api.SYSTEM-DOMAIN - [2021-01-20T19:30:00.911080271Z] "POST /service_instances/acb85c98-151e-4f13-9f0f-de057ef18d67/backup HTTP/1.1" 502 ... x_cf_routererror:"endpoint_failure" ... Where `SYSTEM-DOMAIN` is your system domain.
Cause	Port 2345 that allows communication between the TAS for VMs and ODB components is closed.
Solution	Open port 2345 from the TAS for VMs component to the ODB component. See Required networking rules for VMware SQL with MySQL for TAS in On-Demand Networking.

Troubleshoot components

This section provides guidance on checking for and fixing issues in on-demand service components.

BOSH problems

Large BOSH queue

On-demand service brokers add tasks to the BOSH request queue, which can back up and cause delay under heavy loads. An app developer who requests a new VMware SQL with MySQL for TAS instance sees create in progress in the Cloud Foundry Command Line Interface (cf CLI) until BOSH processes the queued request.

Ops Manager currently deploys two BOSH workers to process its queue. Users of future versions of Ops Manager can configure the number of BOSH workers.

Configuration

Service instances in failing state

The VM or Disk type that you configured in the plan page of the tile in Ops Manager might not be large enough for the VMware SQL with MySQL for TAS service instance to start. See tile-specific guidance on resource requirements.

Authentication

UAA changes

If you have rotated any UAA user credentials then you might see authentication issues in the service broker logs.

To resolve this, redeploy the VMware SQL with MySQL for TAS tile in Ops Manager. This provides the broker with the latest configuration.

You must ensure that any changes to UAA credentials are reflected in the Tanzu Operations Manager credentials tab of the VMware Tanzu Application Service for VMs tile.

Networking

Common issues with networking include:

Issue	Solution
Latency when connecting to the VMware SQL with MySQL for TAS service instance to create or delete a binding.	Try again or improve network performance.
Firewall rules are blocking connections from the VMware SQL with MySQL for TAS service broker to the service instance.	Open the VMware SQL with MySQL for TAS tile in Ops Manager and check the two networks configured in the Networks pane. Ensure that these networks allow access to each other.
Firewall rules are blocking connections from the service network to the BOSH director network.	Ensure that service instances can access the Director so that the BOSH agents can report in.
Apps cannot access the service network.	Configure Cloud Foundry application security groups to allow runtime access to the service network.
Problems accessing BOSH’s UAA or the BOSH director.	Follow network troubleshooting and check that the BOSH director is online

Validate service broker connectivity to service instances

To validate connectivity, do the following:

View the BOSH deployment name for your service broker by running:
bosh deployments
SSH into the Tanzu SQL for VMs service broker by running:
bosh -d DEPLOYMENT-NAME ssh
If no BOSH task-id appears in the error message, look in the broker log using the `broker-request-id` from the task.

Validate app access to service instance

Use cf ssh to access to the app container, then try connecting to the VMware SQL with MySQL for TAS service instance using the binding included in the VCAP_SERVICES environment variable.

Quotas

Plan quota issues

If developers report errors such as:

Message: Service broker error: The quota for this service plan has been exceeded.
Please contact your Operator for help.

Check your current plan quota.
Increase the plan quota.
Log in to Ops Manager.
Reconfigure the quota on the plan page.
Deploy the tile.
Find who is using the plan quota and take the appropriate action.

Global quota issues

If developers report errors such as:

Message: Service broker error: The quota for this service has been exceeded.
Please contact your Operator for help.

Check your current global quota.
Increase the global quota.
Log in to Ops Manager.
Reconfigure the quota on the on-demand settings page.
Deploy the tile.
Find out who is using the quota and take the appropriate action.

Failing jobs and unhealthy instances

To determine whether there is an issue with the VMware SQL with MySQL for TAS deployment:

Inspect the VMs by running:
bosh -d service-instance_GUID vms --vitals
For additional information, run:
bosh -d service-instance_GUID instances --ps --vitals

If the VM is failing, follow the service-specific information. Any unadvised corrective actions (such as running BOSH restart on a VM) can cause issues in the service instance.

A failing process or failing VM might come back automatically after a temporary service outage. See VM process failure and VM failure.

AZ or region failure

Failures at the IaaS level, such as Availability Zone (AZ) or region failures, can interrupt service and require manual restoration. See AZ failure and Region failure.

Techniques for troubleshooting

Instructions on interacting with the on-demand service broker and on-demand service instance BOSH deployments, and on performing general maintenance and housekeeping tasks

Parse a Cloud Foundry error message

Failed operations (create, update, bind, unbind, delete) result in an error message. You can retrieve the error message later by running the cf CLI command cf service INSTANCE-NAME.

$ cf service myservice

Service instance: myservice
Service: super-db
Bound apps:
Tags:
Plan: dedicated-vm
Description: Dedicated Instance
Documentation url:
Dashboard:

Last Operation
Status: create failed
Message: Instance provisioning failed: There was a problem completing your request.
     Please contact your operations team providing the following information:
     service: redis-acceptance,
     service-instance-guid: ae9e232c-0bd5-4684-af27-1b08b0c70089,
     broker-request-id: 63da3a35-24aa-4183-aec6-db8294506bac,
     task-id: 442,
     operation: create
Started: 2017-03-13T10:16:55Z
Updated: 2017-03-13T10:17:58Z

Use the information in the Message field to debug further. Provide this information to Support when filing a ticket.

The task-id field maps to the BOSH task ID. For more information on a failed BOSH task, use the bosh task TASK-ID.

The broker-request-guid maps to the portion of the On-Demand Broker log containing the failed step. Access the broker log through your syslog aggregator, or access BOSH logs for the broker by typing bosh logs broker 0. If you have more than one broker instance, repeat this process for each instance.

Access broker and instance logs and VMs

Before following these procedures, log in to the cf CLI and the BOSH CLI.

Access Broker Logs and VMs

You can access logs using Ops Manager by clicking on the Logs tab in the tile and downloading the broker logs.

To access logs using the BOSH CLI, do the following:

Identify the on-demand broker (ODB) deployment by running the following command:
bosh deployments
View VMs in the deployment by running the following command:
bosh -d DEPLOYMENT-NAME instances
SSH onto the VM by running the following command:
bosh -d DEPLOYMENT-NAME ssh
Download the broker logs by running the following command:
bosh -d DEPLOYMENT-NAME logs

The archive generated by BOSH includes the following logs:

Log Name	Description
broker.stdout.log	Requests to the on-demand broker and the actions the broker performs while orchestrating the request (e.g. generating a manifest and calling BOSH). Start here when troubleshooting.
bpm.log	Control script logs for starting and stopping the on-demand broker.
post-start.stderr.log	Errors that occur during post-start verification.
post-start.stdout.log	Post-start verification.
drain.stderr.log	Errors that occur while running the drain script.

Access service instance logs and VMs

To target an individual service instance deployment, retrieve the GUID of your service instance with the following cf CLI command:
cf service MY-SERVICE --guid
To view VMs in the deployment, run the following command:
bosh -d service-instance_GUID instances
To SSH into a VM, run the following command:
bosh -d service-instance_GUID ssh
To download the instance logs, run the following command:
bosh -d service-instance_GUID logs

Run service broker errands to manage brokers and instances

From the BOSH CLI, you can run service broker errands that manage the service brokers and perform mass operations on the service instances that the brokers created. These service broker errands include:

register-broker registers a broker with the Cloud Controller and lists it in the Marketplace.
deregister-broker deregisters a broker with the Cloud Controller and removes it from the Marketplace.
upgrade-all-service-instances upgrades existing instances of a service to its latest installed version.
delete-all-service-instances deletes all instances of service.
orphan-deployments detects “orphan” instances that are running on BOSH but not registered with the Cloud Controller.

To run an errand, run the following command:

bosh -d DEPLOYMENT-NAME run-errand ERRAND-NAME

For example:

bosh -d my-deployment run-errand deregister-broker

Register broker

The register-broker errand does the following:

Registers the service broker with Cloud Controller.
Activates service access for any plans that are activated on the tile.
Deactivates service access for any plans that are deactivated on the tile.
Does nothing for any plans that are set to manual on the tile.

You can run this errand whenever the broker is redeployed with new catalog metadata to update the Marketplace.

Plans with deactivated service access are only visible to admin Cloud Foundry users. Non-admin Cloud Foundry users, including Org Managers and Space Managers, cannot see these plans.

Deregister broker

This errand deregisters a broker from Cloud Foundry.

The errand does the following:

Deletes the service broker from Cloud Controller
Fails if there are any service instances, with or without bindings

Use the Delete All Service Instances errand to delete any existing service instances.

To run the errand, run the following command:

bosh -d DEPLOYMENT-NAME run-errand deregister-broker

Upgrade all service instances

The upgrade-all-service-instances errand does the following:

Collects all of the service instances that the on-demand broker has registered.
Issues an upgrade command and deploys the a new manifest to the on-demand broker for each service instance.
Adds to a retry list any instances that have ongoing BOSH tasks at the time of upgrade.
Retries any instances in the retry list until all instances are upgraded.

When you make changes to the plan configuration, the errand upgrades all the VMware SQL with MySQL for TAS service instances to the latest version of the plan.

If any instance fails to upgrade, the errand fails immediately. This prevents systemic problems from spreading to the rest of your service instances.

Delete all service instances

This errand uses the Cloud Controller API to delete all instances of your broker’s service offering in every Cloud Foundry org and space. It only deletes instances the Cloud Controller knows about. It does not delete orphan BOSH deployments.

Orphan BOSH deployments do not correspond to a known service instance. While rare, orphan deployments can occur. Use the orphan-deployments errand to identify them.

The delete-all-service-instances errand does the following:

Unbinds all apps from the service instances.
Deletes all service instances sequentially. Each service instance deletion includes:
1. Running any pre-delete errands
2. Deleting the BOSH deployment of the service instance
3. Removing any ODB-managed secrets from BOSH CredHub
4. Checking for instance deletion failure, which results in the errand failing immediately
Determines whether any instances have been created while the errand was running. If new instances are detected, the errand returns an error. In this case, VMware recommends running the errand again.

Use extreme caution when running this errand. You can only use it when you want to totally destroy all of the on-demand service instances in an environment.

To run the errand, run the following command:

bosh -d service-instance_GUID delete-deployment

Detect orphaned service instances

A service instance is defined as “orphaned” when the BOSH deployment for the instance is still running, but the service is no longer registered in Cloud Foundry.

The orphan-deployments errand collates a list of service deployments that have no matching service instances in Cloud Foundry and return the list to the operator. It is then up to the operator to remove the orphaned BOSH deployments.

To run the errand, run the following command:

bosh -d DEPLOYMENT-NAME run-errand orphan-deployments

If orphan deployments exist—The errand script does the following:

Exit with exit code 10
Output a list of deployment names under a [stdout] header
Provide a detailed error message under a [stderr] header

For example:

[stdout]
[{"deployment\_name":"service-instance\_80e3c5a7-80be-49f0-8512-44840f3c4d1b"}]

[stderr]
Orphan BOSH deployments detected with no corresponding service instance in Cloud Foundry. Before deleting any deployment it is recommended to verify the service instance no longer exists in Cloud Foundry and any data is safe to delete.

Errand 'orphan-deployments' completed with error (exit code 10)

These details are also available through the BOSH /tasks/ API endpoint for use in scripting:

$ curl 'https://bosh-user:bosh-password@bosh-url:25555/tasks/task-id/output?type=result' | jq .
{
  "exit_code": 10,
  "stdout": "[{"deployment_name":"service-instance_80e3c5a7-80be-49f0-8512-44840f3c4d1b"}]\n",
  "stderr": "Orphan BOSH deployments detected with no corresponding service instance in Cloud Foundry. Before deleting any deployment it is recommended to verify the service instance no longer exists in Cloud Foundry and any data is safe to delete.\n",
  "logs": {
    "blobstore_id": "d830c4bf-8086-4bc2-8c1d-54d3a3c6d88d"
  }
}

If no orphan deployments exist—The errand script does the following:

Exit with exit code 0
Stdout is an empty list of deployments
Stderr is None

[stdout]
[]

[stderr]
None

Errand 'orphan-deployments' completed successfully (exit code 0)

If the errand encounters an error during running—The errand script does the following:

Exit with exit 1
Stdout is empty
Any error messages are under stderr

To clean up orphaned instances, run the following command on each instance:

Running this command might leave IaaS resources in an unusable state.

bosh delete-deployment service-instance_SERVICE-INSTANCE-GUID

View resource saturation and scaling

To view usage statistics for any service, do the following:

Run the following command:
bosh -d DEPLOYMENT-NAME vms --vitals
To view process-level information, run:
bosh -d DEPLOYMENT-NAME instances --ps

Identify apps using a service instance

To identify which apps are using a specific service instance from the name of the BOSH deployment:

Take the deployment name and strip the service-instance_ leaving you with the GUID.
Log in to CF as an admin.
Obtain a list of all service bindings by running the following: cf curl /v2/service_instances/GUID/service_bindings
The output from the curl gives you a list of resources, with each item referencing a service binding, which contains the APP-URL. To find the name, org, and space for the app, run the following:
1. cf curl APP-URL and record the app name under entity.name.
2. cf curl SPACE-URL to obtain the space, using the entity.space_url from the curl. Record the space name under entity.name.
3. cf curl ORGANIZATION-URL to obtain the org, using the entity.organization_url from the curl. Record the organization name under entity.name.

When you run cf curl ensure that you query all pages, because the responses are limited to a certain number of bindings per page. The default is 50. To find the next page curl the value under next_url.

Monitor quota saturation and service instance count

Quota saturation and total number of service instances are available through ODB metrics emitted to Loggregator. The metric names are shown in the following table:

Metric Name	Description
`on-demand-broker/SERVICE-NAME-MARKETPLACE/quota_remaining`	global quota remaining for all instances across all plans
`on-demand-broker/SERVICE-NAME-MARKETPLACE/PLAN-NAME/quota_remaining`	quota remaining for a particular plan
`on-demand-broker/SERVICE-NAME-MARKETPLACE/total_instances`	total instances created across all plans
`on-demand-broker/SERVICE-NAME-MARKETPLACE/PLAN-NAME/total_instances`	total instances created for a given plan

Quota metrics are not emitted if no quota has been set.

Techniques for troubleshooting highly available clusters

If your cluster is experiencing downtime or in a degraded state, VMware recommends gathering information to diagnose the type of failure the cluster is experiencing with the following workflow:

Consult solutions for common errors. See Highly Available Cluster Troubleshooting Errors.
Use mysql-diag to view a summary of the network, disk, and replication state of each cluster node. Depending on the output from mysql-diag, you might recover your cluster with the following troubleshooting techniques:
- To force a node to rejoin the cluster, see Force a node to rejoin a highly available cluster manually.
- To recreate a corrupted VM, see Recreate a corrupted VM in a highly available cluster.
- To check if replication is working, see Check replication in a highly available cluster.
For more information about mysql-diag, see Running mysql-diag.

Do not attempt to resolve cluster issues by reconfiguring the cluster, such as changing the number of nodes or networks. Only follow the diagnosis steps in this document. If you are unsure how to proceed, contact Support.

Force a node to rejoin a highly available cluster manually

If a detached node fails to rejoin the cluster after a configured grace period, you can manually force the node to rejoin the cluster. This procedure removes all the data on the node, forces the node to join the cluster, and creates a new copy of the cluster data on the node.

If you manually force a node to rejoin the cluster, data stored on the local node is lost. Do not force nodes to rejoin the cluster if you want to preserve unsynchronized data. Only do this procedure with the assistance of Support.

Before following this procedure, try to bootstrap the cluster. For more information, see Bootstrapping.

To manually force a node to rejoin the cluster, do the following:

SSH into the node by following the procedure in SSH Into the BOSH Director VM.
Become root by running: sudo su
Shut down the mysqld process on the node by running: monit stop galera-init
Remove the unsynchronized data on the node by running: rm -rf /var/vcap/store/pxc-mysql
Prepare the node before restarting by running: /var/vcap/jobs/pxc-mysql/bin/pre-start
Restart the mysqld process by running: monit start galera-init

Recreate a corrupted VM in a highly available cluster

To re-create a corrupted VM:

To log in to the BOSH Director VM by doing the following procedures:
1. Gather the information needed to log in to the BOSH Director VM by doing the procedure in Gather Credential and IP Address Information.
2. Log in to the Ops Manager VM by doing the procedure in Log in to the Ops Manager VM with SSH.
3. Log in to the BOSH Director VM by doing the procedure in SSH Into the BOSH Director VM.
Identify and re-create the unresponsive node with bosh cloudcheck, by doing the procedure in BOSH Cloud Check and run Recreate VM using last known apply spec.
Recreating a node clears the logs. Ensure the node is completely down before recreating it.

Only recreate one node. Do not recreate the entire cluster. If more than one node is down, contact Support.

Check replication status in a highly available cluster

If you see stale data in your cluster, you can check whether replication is functioning normally.

To check the replication status, do the following:

To log in to the BOSH Director VM, do the following:
1. Gather the information needed to log in to the BOSH Director VM by using the procedure in [Gather credential and IP Address information](https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/install-trouble-advanced.html).
2. Log in to the Ops Manager VM by doing the procedure in [Log in to the Ops Manager VM with SSH](https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/install-ssh-login.html).
Create a dummy database in the first node by running:
mysql -h FIRST-NODE-IP-ADDRESS \ -u YOUR-IDENTITY \ -p -e "create database verify_healthy;" Where:
- FIRST-NODE-IP-ADDRESS is the IP address of the first node you recorded in step 1.
- YOUR-IDENTITY is the value of identity that you recorded in step 1.
Create a dummy table in the dummy database by running:
mysql -h FIRST-NODE-IP-ADDRESS \ -u your-identity \ -p -D verify_healthy \ -e "create table dummy_table (id int not null primary key auto_increment, info text) \ engine='innodb';"
Insert data into the dummy table by running:
mysql -h FIRST-NODE-IP-ADDRESS \ -u YOUR-IDENTITY \ -p -D verify_healthy \ -e "insert into dummy_table(info) values ('dummy data'),('more dummy data'),('even more dummy data');"
Query the table and verify that the three rows of dummy data exist on the first node by running:
mysql -h FIRST-NODE-IP-ADDRESS \ -u YOUR-IDENTITY \ -p -D verify_healthy \ -e "select * from dummy_table;" When prompted for a password, provide the `password` value recorded in step 1. The previous command returns output similar to the following:
```
    +----+----------------------+
    | id | info                 |
    +----+----------------------+
    |  4 | dummy data           |
    |  7 | more dummy data      |
    | 10 | even more dummy data |
    +----+----------------------+
    
```
Verify that the other nodes contain the same dummy data by doing the following for each of the remaining MySQL server IP addresses:
1. Query the dummy table by running:
  mysql -h NEXT-NODE-IP-ADDRESS \ -u YOUR-IDENTITY \ -p -D verify\_healthy \ -e "select * from dummy_table;" When prompted for a password, provide the `password` value recorded in step 1.
2. Verify that the node contains the same three rows of dummy data as the other nodes by running:
  mysql -h NEXT-NODE-IP-ADDRESS \ -u YOUR-IDENTITY \ -p -D verify\\_healthy \ -e "select \* from dummy\\_table;" When prompted for a password, provide the `password` value recorded in step 1.
3. Verify that the previous command returns output similar to the following:
```
            +----+----------------------+
            | id | info                 |
            +----+----------------------+
            |  4 | dummy data           |
            |  7 | more dummy data      |
            | 10 | even more dummy data |
            +----+----------------------+
            
```
If each MySQL server instance does not return the same result, before proceeding further or making any changes to your deployment, contact [Support](https://tanzu.vmware.com/support).
If each MySQL server instance returns the same result, then you can safely proceed to scaling down your cluster to a single node.

Tools for Troubleshooting

The troubleshooting techniques use these tools.

Downloading logs

The following are steps to gather logs from your MySQL cluster nodes, MySQL proxies, and, with highly available clusters, the jumpbox VM.

From Ops Manager, open your BOSH Director tile > Credentials tab.
Click Bosh Commandline Credentials Link to Credential. A short plaintext file opens.
From the plaintext file, record the values listed:
- BOSH_CLIENT
- BOSH_CLIENT_SECRET
- BOSH_CA_CERT
- BOSH_ENVIRONMENT
From the BOSH CLI, run bosh deployments and record the name of the BOSH deployment that deployed MySQL for VMware Tanzu for MySQL.
SSH into your Ops Manager VM. For information about how to do this, see Gather Credential and IP Address Information and SSH into Ops Manager.
Set local environment variables to the same BOSH variable values that you recorded earlier, including BOSH_DEPLOYMENT for the deployment name you recorded above. For example:
```
$ export BOSH_CLIENT=ops_manager \
  BOSH_CLIENT_SECRET=a123bc-E_4Ke3fb-gImbl3xw4a7meW0rY \
  BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate \
  BOSH_ENVIRONMENT=10.0.0.5 \
  BOSH_DEPLOYMENT=pivotal-mysql-14c4
    
```
If you connect to your BOSH director through a gateway, you also need to set variables BOSH_GW_HOST, BOSH_GW_USER, and BOSH_GW_PRIVATE_KEY.

Use the bosh logs command to retrieve logs for any instances in your deployment that are named database or prefixed with mysql (such as mysql-jumpbox).

The following lines show one way to perform this:

$ tempdir="$(mktemp -d -t MYSQLLOGS-XXXXXX)"
  echo Saving logfiles to "${tempdir}"
  for node in $(bosh instances --column="Instance" | grep -E "(database|mysql.*)/"); do
    echo -e "\nDownloading logs for: ${node}"
    bosh logs --dir="${tempdir}" ${node}
  done
  tar czf "${tempdir}/mysql-logs.tar.gz" ./*
  echo Bundled logfiles are in "${tempdir}/mysql-logs.tar.gz"

For more information, see the bosh logs documentation.

Download the retrieved logfiles to your local laptop for inspection. bosh scp from your local workstation can be used to retrieve files on a BOSH VM. For more information, see the bosh scp documentation.

mysql-diag

mysql-diag outputs the current status of a highly available (HA) MySQL cluster in VMware Tanzu for MySQL and suggests recovery actions if the cluster fails. For more information, see Running mysql-diag.

Knowledge Base (Community)

Find the answer to your question and browse product discussions and solutions by searching the VMware Tanzu Knowledge Base.

Troubleshoot errors

Common services errors

Leader-Follower Service Instance Errors

Inoperable app and database errors

Highly available cluster errors

Failed backups

Troubleshoot components

BOSH problems

Large BOSH queue

Configuration

Service instances in failing state

Authentication

UAA changes

Networking

Validate service broker connectivity to service instances

Validate app access to service instance

Quotas

Plan quota issues

Global quota issues

Failing jobs and unhealthy instances

AZ or region failure

Techniques for troubleshooting

Parse a Cloud Foundry error message

Access broker and instance logs and VMs

Access Broker Logs and VMs

Access service instance logs and VMs

Run service broker errands to manage brokers and instances

Register broker

Deregister broker

Upgrade all service instances

Delete all service instances

Detect orphaned service instances

View resource saturation and scaling

Identify apps using a service instance

Monitor quota saturation and service instance count

Techniques for troubleshooting highly available clusters

Force a node to rejoin a highly available cluster manually

Recreate a corrupted VM in a highly available cluster

Check replication status in a highly available cluster

Tools for Troubleshooting

Downloading logs

mysql-diag

Knowledge Base (Community)

File a support ticket