Automated Backups Fail

Symptom:

Automated backups are failing for no obvious reason.

Potential Cause:

Validate that both IP addresses on the database VM are reachable. Also, compare the database VM uptime with the uptime of the systemd-networkd service running on the VM. If both IP addresses are not reachable and the VM uptime is considerably greater than the service uptime, the systemd-networkd service was likely restarted after VM boot.

Explanation:

The Provider VM and database VM each have two network interfaces. eth0 must be the VM's default gateway. Source-based routing is required for eth1.

Routing rules are removed when the systemd-networkd service is restarted. Data Management for VMware Tanzu takes care of configuring source-based routing during the VM's boot. However, Data Management for VMware Tanzu cannot detect a systemd-networkd service restart that is initiated after the VM is booted. If the service happens to be restarted after boot, you must manually run a script to re-configure source-based routing.

Solution:

  1. Log in to the database VM as the root user.

  2. Manually run the Data Management for VMware Tanzu script that configures source-based routing:

    root@vm$ /opt/vmware/tdm-dbagent/bin/configure_src_based_routing.sh
    
  3. Validate that automated backups return to functioning normally.

Impact:

There is no impact on database uptime, nor on a client's ability to connect to a database.

Replication Stops

Symptom:

For a MySQL Primary database, the Replica database is not able to replicate from the Primary database.

Potential Cause:

For a MySQL Primary database, if the Replica database is down, if replication is stopped voluntarily, or if the Replica database is unable to connect to the Primary database for more than 4 hours, then the Replica database will not be able to replicate from the Primary database.

Explanation:

The MySQL purge binary logs command takes into account the log of databases that are read by the connected Replica databases and the command does not remove such logs or any more recent logs that the databases that are read by the connected will need.

Solution:

Data Management for VMware Tanzu sets the default value of binary log retention period to 4 hours. This ensures that the required binary logs from the Primary database are not purged for 4 hours from when they are created. If you want to configure the binary log retention period to a value different from 4 hours, then perform the following steps:

  1. SSH to the node of Primary database in the vSphere or VMC cluster.
  2. Change the value for the property ‘binlog_retention_hrs’ to a required value in the following file: /opt/vmware/dbaas/adapter/settings.json’.

Impact:

If replication is down for more than 4 hours then the Replica database will not be able to catchup with the Primary databasee. In such scenarios, you need to delete the Replica database and create a new one.

Creation of Replica Fails

Symptom:

If the version of OS of a MySQL Primary or Standalone database is upgraded from 1.0.2 to 1.1.0 of Data Management for VMware Tanzu, the operation of creating Replica database fails.

Potential Cause:

The path of the telegraf.d directory has changed from Release 1.0.2 to Release 1.1.0 of Data Management for VMware Tanzu. Therefore, the path of the telegraf.d directory is missing for existing MySQL Primary or Standalone databases that have been created in Release 1.0.2 of Data Management for VMware Tanzu.

Explanation:

The MySQL purge binary logs command takes into account the log of databases that are read by the connected Replica databases and the command does not remove such logs or any more recent logs that the databases that are read by the connected will need.

Solution:

To configure the path of the telegraf.d directory for the Primary or Standalone database:

  1. SSH to the node of Primary database in the vSphere or VMC cluster.

  2. Login to the Provider VM.

  3. Run the following command

    sudo mkdir /etc/telegraf/telegraf.d
    

Impact:

The operation of creating a Replica database fails for a MySQL Primary or Standalone database that is upgraded from 1.0.2 to 1.1.0 of Data Management for VMware Tanzu.

Replication Status of Databases in a Cluster is Unknown

Symptom:

If nodes in a MYSQL database cluster are unhealthy, powering off all the databases in the cluster and then powering them on one by one, .

Potential Cause:

When nodes in a MYSQL database cluster are unhealthy and if you power off all the databases in the cluster and then powering them on one by one, the Replication Status of the databases in the UI is Unknown despite the Status of the databases being Online.

Explanation:

Unhealthy nodes in a database cluster can hamper smooth functioning of a database cluster.

Solution:

Perform a manual restart of the database cluster remotely from a system where MySQL client tools are installed by running the following command:

``` shell
dba reboot-cluster-from-complete-outage --rejoinInstances=“<node1 db fqdn>:3306","<node2 db fqdn>:3306","<node3 db fqdn>:3306"
```

Impact:

With unhealthy nodes in a cluster, powering off databases in a cluster and then powering them on one by one leads to the Replication Status of the databases in the UI to be Unknown despite the Status of the databases being Online.

Primary Database in a Cluster is not Updated During Update Schedule

Symptom:

During the update schedule (configured as part of the maintenance policy), the Monitor database of a PostgreSQL database cluster is updated but the Primary database is not updated.

Potential Cause:

When the maintenance window of a PostgreSQL database cluster is updated after the creation of the Primary database, and then you succeed or fail in the creation of a Monitor database with no replica databases in the cluster.

Explanation:

Auto promotion in a database cluster fails when there are no Replica databases in the cluster. Backups also fail and the Primary database is not recognised as the Primary database of the cluster due to the absence of Replica databases.

Solution:

You can manually update the Primary database on demand.

Impact:

The Primary database in a PostgreSQL database cluster is not updated during the maintenance window that you have configured.

Recover Operation Fails

Symptom:

When a Provider Administrator, an Organization Administrator, or an Organization User recovers a Standalone or Primary database, the database goes into a Fatal state.

Potential Cause:

The Standalone or Primary database on which the recover operation is performed has an old OS version and has not updated to Release 1.1.0.

Explanation:

VM admin user details have changed. Therefore, if a recover operation is triggered on a Standalone or Primary database that has a version of OS that is earlier than Release 1.1.0, the database goes into a Fatal state.

Solution:

For seamless functioning of the recover operation, ensure that the OS of the database is upgraded to the latest version of the OS.

Impact:

Recover operation fails if it is triggered on a Standalone or Primary database and the database goes into a Fatal state.

check-circle-line exclamation-circle-line close-line
Scroll to top icon