VMware NSX Container Plugin 4.1.1.5 Release Notes

VMware NSX Container Plugin 4.1.1.5 \| 18 June 2024 \| Build 24019664 Check for additions and updates to these release notes.

VMware NSX Container Plugin 4.1.1.5 | 18 June 2024 | Build 24019664

Check for additions and updates to these release notes.

What's New

NSX Container Plugin 4.1.1.5 is an update release that resolves issues found in earlier releases. For other details about this release, see the NSX Container Plugin 4.1.1 Release Notes and the release notes of previous 4.1.1.x releases.

Resolved Issues

Issue 3388531 - NCP fails at startup if NSX server certificate has no CN attribute

If the NSX server certificate specified in NCP configuration is self-signed and does not have the Common Name (CN) attribute, NCP will crash at startup. NCP logs will have the error message "AttributeError: 'NoneType' object has no attribute 'strip'". For TAS and TKGI users, this error message will be found in ncp.stderr.log.

Workaround: Ensure that the CN attribute is set in the NSX server certificate.
Issue 3310167 - TAS in Policy mode does not work with SNAT disabled

Starting with NCP 4.1.0 the default "domain group" for TAS foundation isolation is being built with the CIDRs of the IP blocks specified in the NSX tile for better performance at scale. If SNAT is disabled in the tile configuration, NCP will pass an empty CIDR list to NSX. NSX will respond with an error as the API request is invalid, and NCP will crash.
Enable SNAT in the foundation:
1. If referencing the NSX IP block by name, get its identifier. This may require using the NSX API as it might not be visible in the NSX Manager UI.
2. On Diego database nodes, edit ncp.ini in /var/vcap/jobs/ncp/config. Comment out the container_ip_blocks_entry. Immediately below that entry, add a no_snat_ip_blocks entry. The value of this entry will be the ID (or IDs) of the IP blocks used in the TAS tile.
Note: This workaround is not persistent and will be overwritten the next time a deploy operation is performed for the NSX tile. After you upgrade to a version for a fix, no action is needed to remove the workaround.
Issue 3376407 - Node security concern in Distributed Firewall rules for pod liveness/readiness probe

NCP creates Distributed Firewall rules for pod liveness/readiness probe to allow traffic from node to pod. In manager API mode, the rule allows from node IP to any destination for both ingress and egress traffic, and is applied to both pod and node logical ports. There is a security concern for the node because it allows all node egress traffic.

Workaround: Override the pod liveness/readiness probe Distributed Firewall rule and add the node IP in the destination.
Issue 3327390: In an OCP environment, nsx-node-agent has high memory usage

In some situations, the nsx-ovs container inside an nsx-node-agent pod may have high memory usage, and the memory usage keeps increasing. This is caused by the multicast snooping check in the nsx-ovs container.
Workaround:

For OpenShift 4.11 or later:

Step 1. Set enable_ovs_mcast_snooping to False in nsx-ncp-operator-config ConfigMap:
```
[nsx_node_agent]
enable_ovs_mcast_snooping = False
```
Step 2. Disable OVS liveness probe from nsx node agent DaemonSet. Note that you must disable it again every time the operator restarts because NCP operator will revert to the default nsx node agent DaemonSet manifest.

For OpenShift versions earlier than 4.11:

Step 1. Run the following command to clear the cache.
```
$ echo 2 > /proc/sys/vm/drop_caches
```
Step 2. Disable OVS liveness probe from nsx node agent DaemonSet. Note that you must disable it again every time the operator restarts because NCP operator will revert to the default nsx node agent DaemonSet manifest.
Issue 3365509 - CLI server is not getting created in nsx-kube-proxy

Sometimes it takes about 40 seconds for the privsep daemon to start up. Therefore, the "initialDelaySeconds" value of 10 seconds is not enough for nsx-kube-proxy CLI server to be ready, which will cause nsx-kube-proxy to be restarted repeatedly by kubelet due to liveness probe failure.

Workaround: Increase the value of "initialDelaySeconds" to 60 for nsx-kube-proxy container.
Issue 3358491 - Application's SNAT rule may be deleted by NCP garbage collector on TAS

Besides the default SNAT rule for Org on TAS, NCP could create specific SNAT rule for Application. In manager API mode, NCP garbage collector may delete Application's SNAT rule by mistake.

Workaround: Manually recreate Application's SNAT rule. This rule must be manually deleted when Application is deleted.
Issue 3368202 - Default isolation rules accidentally block legitimate traffic on TAS

NCP in policy mode uses a static IPSet including the CIDRs to enforce isolation for TAS foundations. An issue with the isolation rules will prevent applications from reaching the cloud controller VMs. This will impact environments running NCP in policy mode 4.1.0 through 4.1.1, including those who migrate to policy from manager mode.

Workaround: Update the two firewall rules in the default isolation section for the foundation.

For the rule with source equal to the container CIDR and destination ANY, the rule's direction must be changed from IN_OUT to OUT.

For the rule with destination equal to the container CIDR and source any, the rule's direction must be changed from IN_OUT to IN.

If the TAS foundation is configured to use a NSX principal identity, this operation must be performed via API specifying the X-Allow-Overwrite:True header.
Issue 3376335 - The privsep helper process is not killed when running the command "monit stop"

The privsep helper process's parent PID is 1, it would not be terminated by monit along with the main process of nsx-node-agent or nsx-kube-proxy. Sometimes hyperbus channel is still established with the orphan process until the new nsx-node-agent starts running.

Workaround: If nsx-node-agent job is still running, the stale process has no impact as hyperbus channel could be established with new running process. If nsx-node-agent job is already stopped, kill the existing privsep helper orphan process manually. Run the command "ps -ef | grep node_agent_pri | grep -v grep". This command will print all the stale privsep-helper processes. Use the command "kill -9 $pid" to terminate the processes one by one.