Troubleshooting IPsec for VMware Tanzu

This topic gives you instructions to verify that strongSwan-based IPsec for VMware Tanzu works with your deployment and provides general recommendations for troubleshooting.

Verify that IPsec Works with your deployment

To verify that IPsec works between two hosts, you can check that traffic is encrypted in the deployment with tcpdump, perform the ping test, and check the logs with the steps below.

Check traffic encryption and perform the ping test. Select two hosts in your deployment with IPsec enabled and note their IP addresses. These are referenced below as IP-ADDRESS-1 and IP-ADDRESS-2.
1. SSH into IP-ADDRESS-1.
```
$ ssh IP-ADDRESS-1
```
2. On the first host, run the following, and allow it to continue running.
```
$ tcpdump host IP-ADDRESS-2
```
3. From a separate command line, run the following:
```
$ ssh IP-ADDRESS-2
```
4. On the second host, run the following:
```
$ ping IP-ADDRESS-1
```
5. Verify that the packet type is ESP. If the traffic shows ESP as the packet type, traffic is successfully encrypted. The output from tcpdump looks similar to the following:
```
03:01:15.242731 IP IP-ADDRESS-2 > IP-ADDRESS-1: ESP(spi=0xcfdbb261,seq=0x3), length 100
```
Open the /var/log/daemon.log file to obtain a detailed report, including information pertaining to the type of certificates you use, and to verify an established connection exists.
Navigate to your Installation Dashboard, and click Recent Install Logs to view information regarding your most recent deployment. Search for “ipsec” and the status of the IPsec job.
Run ipsec statusall to return a detailed status report regarding your connections. The typical path for this binary: /var/vcap/packages/strongswan-x.x.x/sbin. x.x.x represents the version of strongSwan packaged into IPsec.

If you experience symptoms that IPsec does not establish a secure connection, return to the Installing IPsec for VMware Tanzu topic and review your installation.

If you encounter issues with installing IPsec, refer to the Troubleshooting IPsec section of this topic.

Troubleshoot IPsec

IPsec Installation Issues

Symptom

Unresponsive applications or incomplete responses, particularly for large payloads

Explanation: Packet Loss

IPsec packet encryption increases the size of packet payloads on host VMs. If the size of the larger packets exceeds the maximum transmission unit (MTU) size of the host VM, packet loss might occur when the VM forwards those packets.

If your VMs were created with an Amazon PV stemcell, the default MTU value is 1500 for both host VMs and the application containers. If your VMs were created with Amazon HVM stemcells, the default MTU value is 9001. Garden containers default to 1500 MTU.

Solution

Either decrease the MTU of the application containers or increase the MTU of the application container VMs. The MTU of the application containers must be at least 100 MTU lower than the MTU of the application container VMs.

Do one of the following:

Decrease the MTU of the application containers.
1. In the TAS for VMs tile configuration, click Networking.
2. Decrease Applications Network Maximum Transmission Unit (MTU) (in bytes) from the default value of 1454 to 1354.
Increase the MTU of the application container VMs.

On a command line, run the following command
```
ifconfig NETWORK-INTERFACE mtu MTU-VALUE
```
Where:
- NETWORK-INTERFACE is the network interface used to communicate with other VMs.
- MTU-VALUE is a value greater than 1500. VMware recommends a headroom of 100.
For example:
```
  $ ifconfig example-interface mtu 1600
```

Symptom

Unresponsive applications or incomplete responses, particularly for large payloads

Explanation: Network Degradation

IPsec data encryption increases the size of packet payloads. If the number of requests and the size of your files are large, the network might degrade.

Solution

Scale your deployment by allocating more processing power to your VM CPU or GPUs, which, additionally, decreases the packet encryption time. One way to increase network performance is to compress the data prior to encryption. This approach increases performance by reducing the amount of data transferred.

Symptom

IPsec fails to deploy with one of the following errors:

IPsec detected that the CA certificate and instance certificate have the same serial number.
Please ensure that the serials are not the same and redeploy.`

IPsec detected that the CA certificate and instance certificate have the same subject.
Please ensure that the subjects are not the same and redeploy.`

Invalid CA/instance certificate.
This might be caused by a duplicate subject or serial number.`

Explanation: Same Serial Number/Common Name

Every certificate in the certificate chain must have a unique serial number and common name, also called the subjectName.

For example, a root certificate cannot have the same common name as an intermediate certificate; an instance certificate cannot have the same serial number as an intermediate certificate.

Solution

Follow the steps in Procedure 1: Rotate the Instance Certificate and Instance Private Key to generate a new chain of certificates and replace the old chain.

When you generate the new certificates, ensure no certificates have a duplicate serial numbers or common names.

IPsec Runtime Issues

Symptom

App fails to start with the following message:

FAILED
Server error,
status code: 500,
error code: 10001,
message: An unknown error occurred.

The Cloud Controller log shows it is unable to communicate with Diego due to getaddrinfo failing.

Deployment fails with a similar error message:

FAILED
diego_database-partition-620982d595434269a96a/0 (a643c6c0-bc43-411b-b011-58f49fb61a6f)' is not running after update. Review logs for failed jobs: etcd

Explanation: Split Brain `consul`

This error indicates a “split brain” issue with Consul.

Solution

Confirm this diagnosis by checking the peers.json file from /var/vcap/store/consul_agent/raft. If it is null, then there might be a split brain. To fix this problem, follow these steps:

Run monit stop on all Consul servers:
Run rm -rf /var/vcap/store/consul_agent/ on all Consul servers.
Run monit start consul_agent on all Consul servers one at a time.
Restart the consul_agent process on the Cloud Controller VM. You might need to restart consul_agent on other VMs, as well.

Symptom

You see that communication is not encrypted between two VMs.

Explanation: Error in Network Configuration

The IPsec BOSH job is not running on either VM. This problem could happen if both IPsec jobs crash, both IPsec jobs fail to start, or the subnet configuration is incorrect. There is a momentary gap between the time when an instance is created and when BOSH sets up IPsec. During this time, data can be sent unencrypted. This length of time depends on the instance type, IAAS, and other factors. For example, on a t2.micro on Amazon Web Services (AWS), the time from networking start to IPsec connection was measured at 95.45 seconds.

Solution

Set up a networking restriction on host VMs to only allow IPsec protocol and block the normal TCP/UDP traffic. For example, in AWS, configure a network security group with the minimal networking setting as shown below and block all other TCP and UDP ports.

Additional AWS Configuration
Type	Protocol	Port Range	Source
Custom Protocol	AH (51)	All	10.0.0.0/16
Custom Protocol	ESP (50)	All	10.0.0.0/16
Custom UDP Rule	UDP	500	10.0.0.0/16

Note When configuring a network security group, IPsec adds an additional layer to the original communication protocol. If a certain connection is targeting a port number, for example port 8080 with TCP, it actually uses IP protocol 50/51 instead. Due to this detail, traffic targeted at a blocked port might be able to go through.

Symptom

You see unencrypted app messages in the logs.

Explanation: `etcd` Split Brain

Solution

Check for split brain etcd by connecting with BOSH ssh into each etcd node:
```
$ curl localhost:4001/v2/members
```
Check if the members are consistent on all of etcd. If a node has only itself as a member, it has formed its own cluster and developed “split brain.” To fix this issue, SSH into the split brain VM and run the following commands:
1. ```
$ sudo su -
```
2. ```
# monit stop etcd
```
3. ```
# rm -r /var/vcap/store/etcd
```
4. ```
# monit start etcd
```
Check the logs to confirm the node rejoined the existing cluster.

Symptom

IPsec deployment fails with Error filling in template ‘pre-start.erb’:

Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'consul_server-partition-f9c4b18fd83cf3114d7f'. Errors are:
     - Unable to render templates for job 'ipsec'. Errors are:
       - Error filling in template 'pre-start.erb' (line 12: undefined method `each_with_index' for #
 
  )
   - Unable to render jobs for instance group 'nats-partition-f9c4b18fd83cf3114d7f'. Errors are:
     - Unable to render templates for job 'ipsec'. Errors are:
       - Error filling in template 'pre-start.erb' (line 12: undefined method `each_with_index' for #
  
   )

Explanation: Typographical or syntax error in deployment descritor YAML syntax

Solution

Check the deployment descriptor YAML syntax for the CA certificates entry:

releases:
- {name: ipsec, version: 1.9.9}

addons:
- name: ipsec-addon
  jobs:
  - name: ipsec
    release: ipsec
    properties:
      ipsec:
        ipsec_subnets:
        - 10.0.1.1/20
        no_ipsec_subnets:
        - 10.0.1.10/32  # bosh director
        instance_certificate: |
          -----BEGIN CERTIFICATE-----
          MIIEMDCCAhigAwIBAgIRAIvrBY2TttU/LeRhO+V1t0YwDQYJKoZIhvcNAQELBQAw
          ...
          -----END CERTIFICATE-----
        instance_private_key: |
          -----BEGIN EXAMPLE RSA PRIVATE KEY-----
          MIIEogIBAAKCAQEAtAkBjrzr5x9g0aWgyDEmLd7m9u/ZzpK7UScfANLaN7JiNz3c
          ...
          -----END EXAMPLE RSA PRIVATE KEY-----
        ca_certificates:
          - |
          -----BEGIN CERTIFICATE-----
          MIIEUDCCArigAwIBAgIJAJVLBeJ9Wm3TMA0GCSqGSIb3DQEBCwUAMB0xGzAZBgNV
          BAMMElBDRiBJUHNlYyBBZGRPbiBDQTAeFw0xNjA4MTUxNzQwNDVaFw0xOTA4MTUx
          ...
          -----END CERTIFICATE-----

In the previous example, the values that appear after the ca_certificates: key are contained within a list and are not just a single certificate. This entry must be followed by a line starting with -, and ending with |. The lines following this contain the PEM encoded certificate(s).

The error message previously shown indicates a problem with the each_with_index method provides a hint that the - | YAML syntax sequence is missing. Use this syntax even in situations where there is only one CA certificate, for example a list of one entry.

Symptom

Complete system outage with no warning.

Explanation: IPsec Certificates Might Have Expired

Expired IPsec certificates can cause a sudden system outage. For example, the self-signed certificates generated by the script provided in the installation instructions have a lifetime of 365 days. IPsec certificates expire if you do not rotate them within their lifetime.

Solution

Renew expired IPsec certificates. To avoid future downtime due to expired IPsec certificates, set a calendar reminder to rotate the certificates before they expire.

For how to renew certificates, see Renewing Expired IPsec Certificates. For how to rotate them, see Rotating IPsec Certificates.

Symptom

BOSH shows a VM in a failing state. On the failing VM, monit summary shows ipsec with a status of Does not exist.

Explanation: IPsec stopped/crashed and Monit Cannot Automatically Bring it Back Up.

IPsec is required to be the first process to start and the last process to stop. As a result, the start and stop scripts are located in pre-start and post-stop, which is a concept to BOSH, but not to Monit. Monit is not able to bring IPsec back up automatically because it does not know what pre-start is. After you run the pre-start manually, then Monit is able to detect IPsec as healthy.

Solution

From the failing VM, run the following command as root: /var/vcap/jobs/ipsec/bin/pre-start
Run monit summary. If the restart is successful, Monit shows ipsec with a status of Running.

IPsec Load Balancer Issues

Symptom

Smoke test errand fails with error:

Setting api endpoint to https://api.sys.YOUR-DOMAIN...
Request error: Get https://api.sys.YOUR-DOMAIN/v2/info: dial tcp https://api.sys.YOUR-DOMAIN:443: i/o timeout
TIP: If you are behind a firewall and require an HTTP proxy, verify the https_proxy environment variable is correctly set. Else, check your network connection.
FAILED

Failure [15.073 seconds]
[BeforeSuite] BeforeSuite
/var/vcap/packages/smoke_tests/src/github.com/cloudfoundry/cf-smoke-tests/smoke/logging/init_test.go:23

No future change is possible.  Bailing out early after 0.001s.

>>> [ cf api https://api.sys.YOUR-DOMAIN --skip-ssl-validation ] exited with an error

Explanation

Clock global is trying to call an API of an app hosted in VMware Tanzu Application Service for VMs. This API call is routed through your external load balancer to Gorouter, which fails if you are using Network Load Balancers (NLB).

This is because NLBs manipulate the packet so that the source IP in the IP header is clock global’s IP instead of the load balancer’s IP. When clock global communicates with the NLB it does not use IPsec because the NLB’s IP is outside the IPsec subnet. However, when Gorouter sees that the source IP of the packet is clock global, it expects the packet to be encrypted with IPsec. It then drops the packet because it is not encrypted with IPsec, resulting in a timeout.

Solution

Do not use Network Load Balancers as your external load balancer in your IaaS. Use Level 7 balancers. To set one up, follow the steps in the guide for your IaaS below.

For AWS, see the AWS documentation.
For GCP, see the Google Cloud documentation.
For Azure, see the Microsoft Azure documentation.

Verify that IPsec Works with your deployment

Troubleshoot IPsec

IPsec Installation Issues

Symptom

Explanation: Packet Loss

Solution

Symptom

Explanation: Network Degradation

Solution

Symptom

Explanation: Same Serial Number/Common Name

Solution

IPsec Runtime Issues

Symptom

Explanation: Split Brain consul

Solution

Symptom

Explanation: Error in Network Configuration

Solution

Symptom

Explanation: etcd Split Brain

Solution

Symptom

Explanation: Typographical or syntax error in deployment descritor YAML syntax

Solution

Symptom

Explanation: IPsec Certificates Might Have Expired

Solution

Symptom

Explanation: IPsec stopped/crashed and Monit Cannot Automatically Bring it Back Up.

Solution

IPsec Load Balancer Issues

Symptom

Explanation

Solution

Explanation: Split Brain `consul`

Explanation: `etcd` Split Brain