Troubleshooting

Automated Backups Fail

Symptom:

Automated backups are failing for no obvious reason.

Potential Cause:

Validate that both IP addresses on the database VM are reachable. Also, compare the database VM uptime with the uptime of the systemd-networkd service running on the VM. If both IP addresses are not reachable and the VM uptime is considerably greater than the service uptime, the systemd-networkd service was likely restarted after VM boot.

Explanation:

The Provider VM and database VM each have two network interfaces. eth0 must be the VM's default gateway. Source-based routing is required for eth1.

Routing rules are removed when the systemd-networkd service is restarted. VMware Data Services Manager takes care of configuring source-based routing during the VM's boot. However, VMware Data Services Manager cannot detect a systemd-networkd service restart that is initiated after the VM is booted. If the service happens to be restarted after boot, you must manually run a script to re-configure source-based routing.

Solution:

Log in to the database VM as the root user.
Manually run the VMware Data Services Manager script that configures source-based routing:
```
root@vm$ /opt/vmware/tdm-dbagent/bin/configure_src_based_routing.sh
```
Validate that automated backups return to functioning normally.

Impact:

There is no impact on database uptime, nor on a client's ability to connect to a database.

Environment Stuck in Maintenance Mode

Symptom:

The status of an Environment is stuck in the Maintenance Mode and all the ensuing tasks related to the Environment are stuck as well.

Potential Cause:

If you have changed the default local or cloud storage of a Namespace that a Database VM uses, and then if you update the local or cloud storage of the Database VM, the update task is completed in the UI but remains stuck for the Agent VM.

Explanation:

Changing the default local or cloud storage of a Namespace that a Database VM uses, and then updating the Database VM reflects the update task as completed in the UI but stuck in the backend for the Agent VM. As this task is stuck, the Environment is stuck in the Maintenance Mode and is not updated.

Solution:

SSH into the Agent VM to which the updated Database VM belongs.
Run the following command:

psql -U postgres -d vmware -c "update vmware.tenant_task set status = 'SUCCESS', reconciled = 't' where tenant_task_type = 'CHANGE_BACKUP_STORAGE' and status = 'TENANT_TASK_QUEUED'"

Impact:

If the Environment is stuck in the Maintenance Mode and is not updated, then all the ensuing tasks related to the Environment are stuck.

Replication Stops

Symptom:

For a MySQL Primary database, the Replica database is not able to replicate from the Primary database.

Potential Cause:

For a MySQL Primary database, if the Replica database is down, if replication is stopped voluntarily, or if the Replica database is unable to connect to the Primary database for more than 4 hours, then the Replica database will not be able to replicate from the Primary database.

Explanation:

The MySQL purge binary logs command takes into account the log of databases that are read by the connected Replica databases and the command does not remove such logs or any more recent logs that the databases that are read by the connected will need.

Solution:

VMware Data Services Manager sets the default value of binary log retention period to 4 hours. This ensures that the required binary logs from the Primary database are not purged for 4 hours from when they are created. If you want to configure the binary log retention period to a value different from 4 hours, then perform the following steps:

SSH to the node of Primary database in the vSphere or VMC cluster.
Change the value for the property ‘binlog_retention_hrs’ to a required value in the following file: /opt/vmware/dbaas/adapter/settings.json’.

Impact:

If replication is down for more than 4 hours then the Replica database will not be able to catchup with the Primary databasee. In such scenarios, you need to delete the Replica database and create a new one.

Creation of Replica Fails

Symptom:

If the version of OS of a MySQL Primary or Standalone database is upgraded from 1.0.2 to 1.1.0 of VMware Data Services Manager, the operation of creating Replica database fails.

Potential Cause:

The path of the telegraf.d directory has changed from Release 1.0.2 to Release 1.1.0 of VMware Data Services Manager. Therefore, the path of the telegraf.d directory is missing for existing MySQL Primary or Standalone databases that have been created in Release 1.0.2 of VMware Data Services Manager.

Explanation:

Solution:

To configure the path of the telegraf.d directory for the Primary or Standalone database:

SSH to the node of Primary database in the vSphere or VMC cluster.
Login to the Provider VM.
Run the following command
```
sudo mkdir /etc/telegraf/telegraf.d
```

Impact:

The operation of creating a Replica database fails for a MySQL Primary or Standalone database that is upgraded from 1.0.2 to 1.1.0 of VMware Data Services Manager.

MySQL Cluster Replication Status is Unknown

Symptom:

The Replication Status of the databases in the UI is Unknown despite the Status of the databases being Online.

Potential Cause:

If all MYSQL databases in a cluster have been powered off and then powered on, Group Replication does not start automatically. In the UI, the Replication Status of the databases is Unknown despite the Status of the databases being Online.

Explanation:

This is a fail-safe mechanism of MYSQL InnoDB Cluster and requires the you startup Group Replication.

Solution:

SSH to the node of Primary database VM as the db-admin user, from there run the following command:

``` 
    cluster-start
```

Impact:

Powering off all databases in a cluster and then powering them on leads to the Replication Status of the databases in the UI to be Unknown despite the Status of the databases being Online.

Primary Database in a Cluster is not Updated During Update Schedule

Symptom:

During the update schedule (configured as part of the maintenance policy), the Monitor database of a PostgreSQL database cluster is updated but the Primary database is not updated.

Potential Cause:

When the maintenance window of a PostgreSQL database cluster is updated after the creation of the Primary database, and then you succeed or fail in the creation of a Monitor database with no replica databases in the cluster.

Explanation:

Auto promotion in a database cluster fails when there are no Replica databases in the cluster. Backups also fail and the Primary database is not recognised as the Primary database of the cluster due to the absence of Replica databases.

Solution:

You can manually update the Primary database on demand.

Impact:

The Primary database in a PostgreSQL database cluster is not updated during the maintenance window that you have configured.

Recover Operation Fails

Symptom:

When a Provider Administrator, an Organization Administrator, or an Organization User recovers a Standalone or Primary database, the database goes into a Fatal state.

Potential Cause:

The Standalone or Primary database on which the recover operation is performed has an old OS version and has not updated to Release 1.1.0.

Explanation:

VM admin user details have changed. Therefore, if a recover operation is triggered on a Standalone or Primary database that has a version of OS that is earlier than Release 1.1.0, the database goes into a Fatal state.

Solution:

For seamless functioning of the recover operation, ensure that the OS of the database is upgraded to the latest version of the OS.

Impact:

Recover operation fails if it is triggered on a Standalone or Primary database and the database goes into a Fatal state.

DNS Resolution Fails for .local Domains

Symptom:

Creation of Replica databases of a database HA cluster fails when the Primary database of that cluster has a domain name that ends with .local.

Potential Cause:

By default, .local domain names are not resolved by DNS.

Explanation:

DNS resolution of database VMs with domain names that end with .local fails beacause default configuration of systemd-resolved.service does not resolve domain names that end with .local. Therefore, when you create all the nodes of a database HA cluster with domain names that end with .local at once, the creation of Replica database fails.

Solution:

To resolve the DNS of database VMs with domain names ending with .local and to enable the creation of Replica database of such database VMs:

SSH into the Primary or Standalone database VM, and then modify /etc/systemd/resolved.conf in that VM.

Add or enable the following two parameters:

DNS=<Same as the DNS supplied via DHCP>
Domains=local

Restart the systemd-resolved.service on the Primary or Standalone database VM.
Create Replica database of the Primary or Standalone database VM

Impact:

When you create all the nodes of a database HA cluster with domain names that end with .local at once, the creation of Replica database fails.