Troubleshooting problems in your Tanzu Operations Manager deployment

You can use the following procedures if you encounter issues while deploying products with VMware Tanzu Operations Manager.

Retrying the deployment

An installation or update can fail for many reasons, but the system is self healing, and can often automatically correct or work around hardware or network faults.

Click Install or Review Pending Changes, then click Apply Changes again, and the system might resolve a problem on its own.

Some failures only produce generic errors like Exited with 1. In cases like this, where a failure is not accompanied by useful information, click Install or Review Pending Changes, then click Apply Changes to retry.

When the system does provide informative evidence, see Common issues to see if your problem is covered.

Common issues

Compare evidence that you have gathered to the following descriptions. If your issue is covered, try the recommended remediation procedures.

BOSH does not reinstall

You might want to reinstall BOSH for troubleshooting purposes. However, if Tanzu Operations Manager does not detect any changes, BOSH does not reinstall. To force a reinstall of BOSH, select BOSH Director, then Resource Sizes and change a resource value. For example, you can increase the amount of RAM by 4 MB.

Creating bound missing VMs times out

This task happens immediately following package compilation, but before job assignment to agents. For example:

cloud_controller/0: Timed out pinging to f690db09-876c-475e-865f-2cece06aba79 after 600 seconds (00:10:24)

This is most likely a NATS issue with the VM in question. To identify a NATS issue, inspect the agent log for the VM. Since the BOSH director is unable to reach the BOSH agent, you must access the VM using another method. You might be unable to access the VM using TCP. In this case, access the VM using your virtualization console.

To diagnose:

Access the VM using your virtualization console and log in.
Go to the Credentials tab of the VMware Tanzu Application Service for VMs (TAS for VMs) tile and locate the VM in question to find the VM credentials.
Become root.
Run cd /var/vcap/bosh/log.
Open the file current.

First, determine whether the BOSH agent and director have successfully completed a handshake, represented in the logs as a “ping-pong”:

2013-10-03_14:35:48.58456 #[608] INFO: Message: {"method"=>"ping", "arguments"=>[], "reply_to"=>"director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab"}

2013-10-03_14:35:48.60182 #[608] INFO: reply_to:   director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab: payload: {:value=>"pong"}

This handshake must complete for the agent to receive instructions from the director.

If you do not see the handshake, look for another line near the beginning of the file, prefixed INFO: loaded new infrastructure settings. For example:

2013-10-03_14:35:21.83222 #[608] INFO: loaded new infrastructure settings:
{"vm"=>{"name"=>"vm-4d80ede4-b0a5-4992-aea6a0386e18e", "id"=>"vm-360"},
"agent_id"=>"56aea4ef-6aa9-4c39-8019-7024ccfdde4",
"networks"=>{"default"=>{"ip"=>"192.0.2.19",
"netmask"=>"255.255.255.0", "cloud_properties"=>{"name"=>"VMNetwork"},
"default"=>["dns", "gateway"],
"dns"=>["192.0.2.2", "192.0.2.17"], "gateway"=>"192.0.2.2",
"dns_record_name"=>"0.nats.default.cf-d729343071061.microbosh",
"mac"=>"00:50:56:9b:71:67"}}, "disks"=>{"system"=>0, "ephemeral"=>1,
"persistent"=>{}}, "ntp"=>[], "blobstore"=>{"provider"=>"dav",
"options"=>{"endpoint"=>"http://192.0.2.17:25250",
"user"=>"agent", "password"=>"agent"}},
"mbus"=>"nats://nats:[email protected]:4222",
"env"=>{"bosh"=>{"password"=>"$6$40ftQ9K4rvvC/8ADZHW0"}}}

This is a JSON blob of key-value pairs representing the expected infrastructure for the BOSH agent.

For this issue, the following section is the most important:

"mbus"=>"nats://nats:[email protected]:4222"

This key-value pair represents where the agent expects the NATS server to be. One diagnostic tactic is to try pinging this NATS IP address from the VM to determine whether you are experiencing routing issues.

Install exits with a creates/updates/deletes app failure or with a 403 error

Scenario 1: Your Tanzu Operations Manager install exits with the following 403 error when you attempt to log in to the Apps Manager:

{"type": "step_finished", "id": "apps-manager.deploy"}

/home/tempest-web/tempest/web/vendor/bundle/ruby/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:306:in
`fetch': 403 => Net::HTTPForbidden for https://login.api.example.net/oauth/authorizeresponse_type=code&client_id=portal&redirect_uri=https%3...
-- unhandled response (Mechanize::ResponseCodeError)

Scenario 2: Your Tanzu Operations Manager install exits with a creates/updates/deletes an app (FAILED - 1) error message with the following stack trace:

1) App CRUD creates/updates/deletes an app
   Failure/Error: Unable to find matching line from backtrace
   CFoundry::TargetRefused:
     Connection refused - connect(2)

In either of the previous scenarios, ensure that you have correctly entered your domains in wildcard format:

Browse to the fully qualified domain name (FQDN) of Tanzu Operations Manager.
Click the TAS for VMs tile.
Click HAProxy, then Generate Self-Signed RSA Certificate.
Enter your system and app domains in wildcard format, as well as optionally any custom domains, and click Save. Refer to TAS for VMs Cloud Controller for explanations of these domain values.

Generate Self-Signed RSA Certificate dialog box

Install fails when Gateway instances exceed zero

If you configure the number of Gateway instances to be greater than zero for a given product, you create a dependency on TAS for VMs for that product installation. If you attempt to install a product tile with a TAS for VMs dependency before installing TAS for VMs, the install fails.

To change the number of Gateway instances, click the product tile, then select Settings, then Resource sizes, followed by INSTANCES and change the value next to the product Gateway job.

To remove the TAS for VMs dependency, change the value of this field to 0.

Out of disk space error

Tanzu Operations Manager displays an Out of Disk Space error if log files expand to fill all available disk space. If this happens, rebooting the Tanzu Operations Manager installation VM clears the tmp directory of these log files and resolves the error.

If you receive Out of Disk Space errors when trying to push apps, this can mean that Diego cells might be running out of disk space capacity.

To perform a detailed analysis of disk usage by containers and host VMs in your Tanzu Operations Manager deployment, see Examining GrootFS disk usage.

Installing BOSH Director fails in vSphere

If the DNS information for the Tanzu Operations Manager VM is incorrectly specified when deploying the Tanzu Operations Manager .ova file, installing BOSH Director fails at the “Installing Micro BOSH” step.

To resolve this issue, correct the DNS settings in the Tanzu Operations Manager Virtual Machine properties.

Deleting a deployment fails

Tanzu Operations Manager displays an error message when it cannot delete your installation. This scenario might happen if the BOSH Director cannot access the VMs or is experiencing other issues. To manually delete your installation and all VMs, you must do the following:

Use your IaaS dashboard to manually delete the VMs for all installed products, with the exception of the Tanzu Operations Manager VM.
SSH into your Tanzu Operations Manager VM and remove the installation.yml file from /var/tempest/workspaces/default/.

Deleting the installation.yml file does not prevent you from reinstalling a deployment. For future deploys, Tanzu Operations Manager regenerates this file when you click Save on any page in the BOSH Director tile.

Your installation is now deleted.

Installing TAS for VMs fails

The following sections describe errors that may occur when installing TAS for VMs and how to resolve them.

TAS for VMs Cannot verify app push

If the DNS information for the Tanzu Operations Manager VM becomes incorrect after BOSH Director has been installed, installing TAS for VMs fails at the “Verifying app push” step.

To resolve this issue, correct the DNS settings in the Tanzu Operations Manager Virtual Machine properties.

MySQL Monitor not running after update

When MySQL cannot communicate with UAA, Tanzu Operations Manager shows the following error:

Error: 'mysql_monitor/12a3b456-cd7e-8fgh-9012-345678b90ijk (0)' is not running after update. Review logs for failed jobs: replication-canary.

If you see this error, create firewall rules that allow MySQL to reach UAA, using MySQL Network Communications as a reference.

Tanzu Operations Manager hangs during MicroBOSH install or HAProxy states “IP Address Already Taken”

During a TAS for VMs installation, you might receive the following errors:

The Tanzu Operations Manager GUI shows that the installation stops at the “Setting MicroBOSH deployment manifest” task.
When you set the IP address for the HAProxy, the “IP Address Already Taken” message appears.

When you install the BOSH Director, you assign it an IP address. Tanzu Operations Manager then takes the next two consecutive IP addresses, assigns the first to MicroBOSH, and reserves the second. For example:

203.0.113.1 - Tanzu Operations Manager (User assigned)
203.0.113.2 - MicroBOSH (Tanzu Operations Manager assigned)
203.0.113.3 - Reserved (Tanzu Operations Manager reserved)

To resolve this issue, ensure that the next two subsequent IP addresses from the manually assigned address are unassigned.

Poor performance

If you notice poor network performance by your deployment and your deployment uses a Network Address Translation (NAT) gateway, your NAT gateway may be under-resourced.

Troubleshoot

To troubleshoot the issue, set a custom firewall rule in your IaaS console to route traffic originating from your private network directly to an S3-compatible object store. If you see decreased average latency and improved network performance, perform the solution below to scale up your NAT gateway.

Scale up your NAT gateway

Complete the following steps to scale up your NAT gateway:

Go to your IaaS console.
Spin up a new NAT gateway of a larger VM size than your previous NAT gateway.
Change the routes to direct traffic through the new NAT gateway.
Spin down the old NAT gateway.

The specific procedures vary depending on your IaaS. Consult your IaaS documentation for more information.

Common issues caused by firewalls

This section describes various issues you might encounter when installing Tanzu Operations Manager in an environment that uses a strong firewall.

DNS resolution fails

When you install Tanzu Operations Manager in an environment that uses a strong firewall, the firewall might block DNS resolution. To resolve this issue, see Verify Tanzu Operations Manager resolves DNS Entries behind a firewall in Preparing Your Firewall.