Troubleshooting NSX Advanced Load Balancer

NSX Advanced Load Balancer Configuration Is Not Applied

When you deploy the Supervisor, the deployment does not complete and the NSX Advanced Load Balancer configuration is not applied.

Problem

The configuration of the NSX Advanced Load Balancer does not get applied if you provide a private Certificate Authority (CA) signed certificate.

You might see an error message with Unable to find certificate chain in the log files of one of the NCP pods running on the Supervisor.

Log in to the Supervisor VM.
List all pods with the command kubectl get pods -A
Get the logs from all the NCP pods on the Supervisor.
kubectl -n vmware-system-nsx logs nsx-ncp-<id> | grep -i alb

Cause

The Java SDK is used to establish communication between NCP and the NSX Advanced Load Balancer Controller. This error occurs when the NSX trust store is not synchronized with the Java certificate trust store.

Solution

Export the root CA certificate from the NSX Advanced Load Balancer and save it on the NSX Manager.
Log in to NSX Manager as a root user.
Run the following commands sequentially on all the NSX Manager nodes:
```
keytool -importcert -alias startssl -keystore /usr/lib/jvm/jre/lib/security/cacerts -storepass changeit -file <ca-file-path>
```
If the path is not found, run keytool -importcert -alias startssl -keystore /usr/java/jre/lib/security/cacerts -storepass changeit -file <ca-file-path>
```
sudo cp <ca-file-path> /usr/local/share/ca-certificates/
sudo update-ca-certificates
service proton restart
```
Note: You can perform the same steps to assign an intermediate CA certificate.
Wait for the Supervisor deployment to finish or if deployment does not happen, redeploy it again.

ESXi Host Cannot Enter Maintenance Mode

You place an ESXi host in maintenance mode when you want to perform an upgrade.

Problem

The ESXi host cannot enter maintenance mode and can impact ESXi and NSX upgrade.

Cause

This can occur if there is a Service Engine in a powered-on state on the ESXi host.

Solution

♦ Power off the Service Engine so that the ESXi host can enter maintenance mode.

Troubleshooting IP Address Issues

Follow these troubleshooting tips if you encounter external IP assignment issues.

IP address issues can occur due to the following reasons:

Kubernetes resources, such as the gateways and ingress do not get an external IP from the AKO.
External IPs that are assigned to Kubernetes resources are not reachable.
External IPs that are incorrectly assigned.

Kubernetes resources do not get an external IP from the AKO

This error occurs when AKO cannot create the corresponding virtual service in the NSX Advanced Load Balancer Controller.

Check if the AKO pod is running. If the pod is running, check the AKO container logs for the error.

External IPs assigned to Kubernetes resources are not reachable

This issue can occur for the following reasons:

The external IP is not available immediately but starts accepting traffic within a few minutes of creation. This occurs when a new service engine creation is triggered for virtual service placement.
The external IP is not available because the corresponding virtual service shows an error.

A virtual service could indicate an error or appear red if there are no servers in the pool. This could occur if the Kubernetes gateway or ingress resource does not point an endpoint object.

To see the endpoints, run the kubectl get endpoints -n <servce_namespace>command and fix any selector label issues.

The pool could appear in a state of error when the Health Monitor shows the health of the pool servers as red.

Perform one of the following steps to resolve this issue:

Verify if the pool servers or Kubernetes pods are listening on the configured port.
Verify that there are no drop rules in the NSX DFW firewall that are blocking ingress or egress traffic on the service engines.
Ensure there are no network policies in the Kubernetes environment that are blocking ingress or egress traffic on the service engines.

Service engine issues include the following:

Creation of Service Engines fails.
Creation of Service Engines can fail due to the following reasons:
- A license with insufficient resources is used in the NSX Advanced Load Balancer Controller.
- The number of Service Engines created in a Service Engine Group reached the maximum limit.
- The Service Engine Data NIC failed to acquire IP.
Service Engine creation fails with an Insufficient licensable resources available error message.
This error occurs if a license with insufficient resources was used to create the Service Engine.
Get a license with larger quota of resources and assign it to the NSX Advanced Load Balancer Controller.
Service Engine creation fails with a Reached configuration maximum limit error message.
This error occurs if the number of Service Engines created in a Service Engine Group reached the maximum limit.
To resolve this error, perform the following steps:
1. In the NSX Advanced Load Balancer Controller dashboard, select Infrastructure > Cloud Resources > Service Engine Group.
2. Find the Service Engine group with the same name as the Supervisor in which the IP traffic failure is occurring and click the Edit icon.
3. Configure a higher value for Number of Service Engines.
The Service Engine Data NIC fails to acquire IP.
This error might occur if the DHCP IP pool has been exhausted for one of the following reasons:
- Too many Service Engines have been created for a large scale deployment.
- If a Service Engine is deleted directly from the NSX Advanced Load Balancer UI or the vSphere Client. Such a deletion does not release the DHCP address from the DHCP pool and leads to a LEASE Allocation Failure.

External IPs are incorrectly assigned

This error occurs when two ingresses in different namespaces share the same hostname. Check your configuration and verify that the same name is not given to two ingresses in different namespaces.

Troubleshooting Traffic Failure Issues

After you configure the NSX Advanced Load Balancer, traffic failures occur.

Problem

Traffic failures might occur when the endpoint for the service of type LB is in a different namespace.

Cause

In vSphere IaaS control plane environments configured with NSX Advanced Load Balancer, namespaces have a dedicated tier-1 gateway and each tier-1 gateway has a service engine segment with the same CIDR. Traffic failures might occur if the NSX Advanced Load Balancer service is in one namespace and the endpoints are in a different namespace. The failure occurs because the NSX Advanced Load Balancer assigns an external IP to the service and traffic to the external IP fails.

Solution

♦ To allow north-south traffic create a distributed firewall rule to allow ingress from the SNAT IP of the NSX Advanced Load Balancer service namespace.

Troubleshooting Issues caused by NSX Backup and Restore

NSX backup and restore can lead to traffic failure for all the external IPs provided by the NSX Advanced Load Balancer.

Problem

When you perform a backup and restore of NSX, it can lead to traffic failure.

Cause

This failure occurs as the Service Engine NICs do not come back up after a restore and as a result, the IP pool shows as down.

Solution

In the NSX Advanced Load Balancer Controller dashboard, select Infrastructure > Clouds.
Select and save the cloud without making any changes, and wait for the status to become green.
Deactivate all the virtual services.
Wait for the NSX Advanced Load Balancer Controller to remove the stale NICs from all the Service Engines.
Enable all the virtual services.
The virtual services statuses show as green.

If traffic failure persists, reconfigure the static routes on the NSX Manager.

Stale Tier-1 Segments after NSX Backup and Restore

NSX backup and restore can restore stale tier-1 segments.

Problem

After an NSX backup and restore procedure, stale tier-1 segments that have Service Engine NICs do not get cleaned up.

Cause

When a namespace is deleted after an NSX backup, the restore operation restores stale tier-1 segments that are associated with the NSX Advanced Load Balancer Controller Service Engine NICs.

Solution

Log in to the NSX Manager.
Select Networking > Segments.
Find the stale segments that are associated with the deleted namespace.
Delete the stale Service Engine NICs from the Ports/Interfaces section.