This guide is intended to provides assistance with resolving issues encountered when deploying products with VMware Tanzu Operations Manager (Ops Manager).
An installation or update can fail for many reasons, but the system is self-healing, and can often automatically correct or work around hardware or network faults.
Click Install or Review Pending Changes, then Apply Changes again, and the system may resolve a problem on its own.
Some failures only produce generic errors like Exited with 1
. In cases like this, where a failure is not accompanied by useful information, click Install or Review Pending Changes, then Apply Changes to retry.
When the system does provide informative evidence, see Common Issues section below to see if your problem is covered.
Compare evidence that you have gathered to the descriptions below. If your issue is covered, try the recommended remediation procedures.
You might want to reinstall BOSH for troubleshooting purposes. However, if Ops Manager does not detect any changes, BOSH does not reinstall. To force a reinstall of BOSH, select BOSH Director > Resource Sizes and change a resource value. For example, you could increase the amount of RAM by 4 MB.
This task happens immediately following package compilation, but before job assignment to agents. For example:
cloud_controller/0: Timed out pinging to f690db09-876c-475e-865f-2cece06aba79 after 600 seconds (00:10:24)
This is most likely a NATS issue with the VM in question. To identify a NATS issue, inspect the agent log for the VM. Since the BOSH director is unable to reach the BOSH agent, you must access the VM using another method. You will likely also be unable to access the VM using TCP. In this case, access the VM using your virtualization console.
To diagnose:
Access the VM using your virtualization console and log in.
Navigate to the Credentials tab of the VMware Tanzu Application Service for VMs (TAS for VMs) tile and locate the VM in question to find the VM credentials.
Become root.
Run cd /var/vcap/bosh/log
.
Open the file current
.
First, determine whether the BOSH agent and director have successfully completed a handshake, represented in the logs as a “ping-pong”:
2013-10-03_14:35:48.58456 #[608] INFO: Message: {"method"=>"ping", "arguments"=>[], "reply_to"=>"director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab"} 2013-10-03_14:35:48.60182 #[608] INFO: reply_to: director.f4b7df14-cb8f.19719508-e0dd-4f53-b755-58b6336058ab: payload: {:value=>"pong"}
This handshake must complete for the agent to receive instructions from the director.
If you do not see the handshake, look for another line near the beginning of the file, prefixed INFO: loaded new infrastructure settings
. For example:
2013-10-03_14:35:21.83222 #[608] INFO: loaded new infrastructure settings: {"vm"=>{"name"=>"vm-4d80ede4-b0a5-4992-aea6a0386e18e", "id"=>"vm-360"}, "agent_id"=>"56aea4ef-6aa9-4c39-8019-7024ccfdde4", "networks"=>{"default"=>{"ip"=>"192.0.2.19", "netmask"=>"255.255.255.0", "cloud_properties"=>{"name"=>"VMNetwork"}, "default"=>["dns", "gateway"], "dns"=>["192.0.2.2", "192.0.2.17"], "gateway"=>"192.0.2.2", "dns_record_name"=>"0.nats.default.cf-d729343071061.microbosh", "mac"=>"00:50:56:9b:71:67"}}, "disks"=>{"system"=>0, "ephemeral"=>1, "persistent"=>{}}, "ntp"=>[], "blobstore"=>{"provider"=>"dav", "options"=>{"endpoint"=>"http://192.0.2.17:25250", "user"=>"agent", "password"=>"agent"}}, "mbus"=>"nats://nats:nats@192.0.2.17:4222", "env"=>{"bosh"=>{"password"=>"$6$40ftQ9K4rvvC/8ADZHW0"}}}
This is a JSON blob of key-value pairs representing the expected infrastructure for the BOSH agent. For this issue, the following section is the most important:
"mbus"=>"nats://nats:nats@192.0.2.17:4222"
This key-value pair represents where the agent expects the NATS server to be. One diagnostic tactic is to try pinging this NATS IP address from the VM to determine whether you are experiencing routing issues.
Scenario 1: Your Ops Manager install exits with the following 403 error when you attempt to log in to the Apps Manager:
{"type": "step_finished", "id": "apps-manager.deploy"}
/home/tempest-web/tempest/web/vendor/bundle/ruby/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:306:in
`fetch': 403 => Net::HTTPForbidden for https://login.api.example.net/oauth/authorizeresponse_type=code&client_id=portal&redirect_uri=https%3...
-- unhandled response (Mechanize::ResponseCodeError)
Scenario 2: Your Ops Manager install exits with a creates/updates/deletes an app (FAILED - 1)
error message with the following stack trace:
1) App CRUD creates/updates/deletes an app
Failure/Error: Unable to find matching line from backtrace
CFoundry::TargetRefused:
Connection refused - connect(2)
In either of the above scenarios, ensure that you have correctly entered your domains in wildcard format:
Browse to the fully qualified domain name (FQDN) of Ops Manager.
Click the TAS for VMs tile.
Select HAProxy and click Generate Self-Signed RSA Certificate.
Enter your system and app domains in wildcard format, as well as optionally any custom domains, and click Save. Refer to TAS for VMs Cloud Controller for explanations of these domain values.
If you configure the number of Gateway instances to be greater than zero for a given product, you create a dependency on TAS for VMs for that product installation. If you attempt to install a product tile with a TAS for VMs dependency before installing TAS for VMs, the install fails.
To change the number of Gateway instances, click the product tile, then select Settings > Resource sizes > INSTANCES and change the value next to the product Gateway job.
To remove the TAS for VMs dependency, change the value of this field to 0
.
Ops Manager displays an Out of Disk Space
error if log files expand to fill all available disk space. If this happens, rebooting the Ops Manager installation VM clears the tmp directory of these log files and resolves the error.
If users receive Out of Disk Space
errors when trying to push apps, this can mean that Diego cells may be running out of disk space capacity.
To perform a detailed analysis of disk usage by containers and host VMs in your Ops Manager deployment, see Examining GrootFS Disk Usage.
If the DNS information for the Ops Manager VM is incorrectly specified when deploying the Ops Manager .ova file, installing BOSH Director fails at the “Installing Micro BOSH” step.
To resolve this issue, correct the DNS settings in the Ops Manager Virtual Machine properties.
Ops Manager displays an error message when it cannot delete your installation. This scenario might happen if the BOSH Director cannot access the VMs or is experiencing other issues. To manually delete your installation and all VMs, you must do the following:
Use your IaaS dashboard to manually delete the VMs for all installed products, with the exception of the Ops Manager VM.
SSH into your Ops Manager VM and remove the installation.yml
file from /var/tempest/workspaces/default/
.
Note: Deleting the installation.yml
file does not prevent you from reinstalling a deployment. For future deploys, Ops Manager regenerates this file when you click Save on any page in the BOSH Director tile.
Your installation is now deleted.
The following sections describe errors that may occur when installing TAS for VMs and how to resolve them.
If the DNS information for the Ops Manager VM becomes incorrect after BOSH Director has been installed, installing TAS for VMs fails at the “Verifying app push” step.
To resolve this issue, correct the DNS settings in the Ops Manager Virtual Machine properties.
When MySQL cannot communicate with UAA, Ops Manager shows the following error:
Error: 'mysql_monitor/12a3b456-cd7e-8fgh-9012-345678b90ijk (0)' is not running after update. Review logs for failed jobs: replication-canary.
If you see this error, create firewall rules that allow MySQL to reach UAA, using MySQL Network Communications as a reference.
During a TAS for VMs installation, you might receive the following errors:
The Ops Manager GUI shows that the installation stops at the “Setting MicroBOSH deployment manifest” task.
When you set the IP address for the HAProxy, the “IP Address Already Taken” message appears.
When you install the BOSH Director, you assign it an IP address. Ops Manager then takes the next two consecutive IP addresses, assigns the first to MicroBOSH, and reserves the second. For example:
203.0.113.1 - Ops Manager (User assigned)
203.0.113.2 - MicroBOSH (Ops Manager assigned)
203.0.113.3 - Reserved (Ops Manager reserved)
To resolve this issue, ensure that the next two subsequent IP addresses from the manually assigned address are unassigned.
If you notice poor network performance by your deployment and your deployment uses a Network Address Translation (NAT) gateway, your NAT gateway may be under-resourced.
To troubleshoot the issue, set a custom firewall rule in your IaaS console to route traffic originating from your private network directly to an S3-compatible object store. If you see decreased average latency and improved network performance, perform the solution below to scale up your NAT gateway.
Perform the following steps to scale up your NAT gateway:
Navigate to your IaaS console.
Spin up a new NAT gateway of a larger VM size than your previous NAT gateway.
Change the routes to direct traffic through the new NAT gateway.
Spin down the old NAT gateway.
The specific procedures will vary depending on your IaaS. Consult your IaaS documentation for more information.
This section describes various issues you might encounter when installing Ops Manager in an environment that uses a strong firewall.
When you install Ops Manager in an environment that uses a strong firewall, the firewall might block DNS resolution. To resolve this issue, see Verify Ops Manager Resolves DNS Entries Behind a Firewall in Preparing Your Firewall.