VMware NSX Container Plugin 4.1.2.2 Release Notes

VMware NSX Container Plugin 4.1.2.2 \| 2 July 2024 \| Build 24047052 Check for additions and updates to these release notes.

VMware NSX Container Plugin 4.1.2.2 | 2 July 2024 | Build 24047052

Check for additions and updates to these release notes.

What's New

NSX Container Plugin 4.1.2.2 is an update release that resolves issues found in earlier releases. For other details about this release, see the previous 4.1.2.x release notes.

Compatibility Requirements

Product	Version
NCP/NSX Tile for Tanzu Application Service (TAS)	4.1.2.2
NSX	3.2.4, 4.1.1, 4.1.2.4
Kubernetes	1.26, 1.27, 1.28
OpenShift 4	4.12, 4.13, 4.14
Kubernetes Host VM OS	Ubuntu 20.04 Ubuntu 22.04 with kernel 5.15 (both nsx-ovs kernel module and upstream OVS kernel module supported) Ubuntu 22.04 with kernel later than 5.15 (only upstream OVS kernel module supported) RHEL 8.8, 8.9, 9.2, 9.3 See notes below.
Tanzu Application Service (TAS)	Ops Manager 2.10 + TAS 2.13 Ops Manager 3.0 + TAS 2.13 Ops Manager 2.10 + TAS 4.0 Ops Manager 3.0 + TAS 4.0 Ops Manager 2.10 + TAS 5.0 Ops Manager 3.0 + TAS 5.0
Tanzu Kubernetes Grid Integrated (TKGI)	1.18.2, 1.19.0

Notes:

For all supported integrations, use the Red Hat Universal Base Image (UBI). For more information, https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image.

Support for upgrading to this release:

4.1.0. 4.1.1, and all 4.0.x releases.

Resolved Issues

Issue 3368202: Default isolation rules accidentally block legitimate traffic on TAS

NCP in policy mode uses a static IPSet including the CIDRs to enforce isolation for TAS foundations. An issue with the isolation rules will prevent applications from reaching the cloud controller VMs. This will impact environments running NCP in policy mode 4.1.0 through 4.1.1, including those who migrate to policy from manager mode.

Workaround: Update the two firewall rules in the default isolation section for the foundation.

For the rule with source equal to the container CIDR and destination ANY, the rule's direction must be changed from IN_OUT to OUT.

For the rule with destination equal to the container CIDR and source any, the rule's direction must be changed from IN_OUT to IN.

If the TAS foundation is configured to use a NSX principal identity, this operation must be performed via API specifying the X-Allow-Overwrite:True header.
Issue 3388531: NCP fails at startup if NSX server certificate has no CN attribute

If the NSX server certificate specified in NCP configuration is self-signed and does not have the Common Name (CN) attribute, NCP will crash at startup. NCP logs will have the error message "AttributeError: 'NoneType' object has no attribute 'strip'". For TAS and TKGI users, this error message will be found in ncp.stderr.log.

Workaround: Ensure that the CN attribute is set in the NSX server certificate.
Issue 3365509: CLI server is not getting created in nsx-kube-proxy

Sometimes it takes about 40 seconds for the privsep daemon to start up. Therefore, the "initialDelaySeconds" value of 10 seconds is not enough for nsx-kube-proxy CLI server to be ready, which will cause nsx-kube-proxy to be restarted repeatedly by kubelet due to liveness probe failure.

Workaround: Increase the value of "initialDelaySeconds" to 60 for nsx-kube-proxy container.
Issue 3358491: Application's SNAT rule may be deleted by NCP garbage collector on TAS

Besides the default SNAT rule for Org on TAS, NCP could create specific SNAT rule for Application. In manager API mode, NCP garbage collector may delete Application's SNAT rule by mistake.

Workaround: Manually recreate Application's SNAT rule. This rule must be manually deleted when Application is deleted.
Issue 3327390: In an OCP environment, nsx-node-agent has high memory usage

In some situations, the nsx-ovs container inside an nsx-node-agent pod may have high memory usage, and the memory usage keeps increasing. This is caused by the multicast snooping check in the nsx-ovs container.
Workaround:

For OpenShift 4.11 or later:

Step 1. Set enable_ovs_mcast_snooping to False in nsx-ncp-operator-config ConfigMap:
```
[nsx_node_agent]
enable_ovs_mcast_snooping = False
```
Step 2. Disable OVS liveness probe from nsx node agent DaemonSet. Note that you must disable it again every time the operator restarts because NCP operator will revert to the default nsx node agent DaemonSet manifest.

For OpenShift versions earlier than 4.11:

Step 1. Run the following command to clear the cache.
```
$ echo 2 > /proc/sys/vm/drop_caches
```
Step 2. Disable OVS liveness probe from nsx node agent DaemonSet. Note that you must disable it again every time the operator restarts because NCP operator will revert to the default nsx node agent DaemonSet manifest.
Issue 3376407: Node security concern in Distributed Firewall rules for pod liveness/readiness probe

NCP creates Distributed Firewall rules for pod liveness/readiness probe to allow traffic from node to pod. In manager API mode, the rule allows from node IP to any destination for both ingress and egress traffic, and is applied to both pod and node logical ports. There is a security concern for the node because it allows all node egress traffic.

Workaround: Override the pod liveness/readiness probe Distributed Firewall rule and add the node IP in the destination.
Issue 3376335: The privsep helper process is not killed when running the command "monit stop"

The privsep helper process's parent PID is 1, it would not be terminated by monit along with the main process of nsx-node-agent or nsx-kube-proxy. Sometimes hyperbus channel is still established with the orphan process until the new nsx-node-agent starts running.

Workaround: If nsx-node-agent job is still running, the stale process has no impact as hyperbus channel could be established with new running process. If nsx-node-agent job is already stopped, kill the existing privsep helper orphan process manually. Run the command "ps -ef | grep node_agent_pri | grep -v grep". This command will print all the stale privsep-helper processes. Use the command "kill -9 $pid" to terminate the processes one by one.
Issue 3377195: Garbage collector of LRP Controller goes into a bad state when detecting stale LRP in runtime store

The garbage collector of LRP Controller will compare its local runtime store with the actual LRP list from BBS. If there is any stale LRP in the local store, the garbage collector will delete it from the store and also delete the related port on the NSX side. In some cases, the garbage collector will crash in the next iteration after it deletes one item from the store. The NCP log shows an error such as 'RuntimeError: dictionary changed size during iteration'.

Workaround: Restart NCP.
Issue 3380448: Policy migration rollback causes rollback of IP allocations for already migrated TKGI clusters or TAS foundations

When multiple TKGI clusters or TAS foundations use the same external IP pool, a Policy migration rollback will cause rollback of IP allocations for the external pool for all TKGI clusters or TAS foundations that have already been migrated. The rollback will impact only the policy intent. The IP address will still be allocated in NSX and will not be available to be used by other clusters. However, there might be unexpected issues while creating services of type load balancers, or configuring namespaces as IP allocation could fail.

Workaround: None.
Issue 3392730: NAT rules' firewall match behavior changes after migration to policy

An issue during startup will cause an update of every SNAT rule configured by NCP immediately after completing migration to policy. As a result, the behavior of these SNAT rules with respect to gateway firewall changes from BYPASS to MATCH_INTERNAL_ADDRESS. This will cause gateway firewall rules to be evaluated on NAT traffic using the address before SNAT occurs. This issue impacts TAS deployments and TKGI clusters with dedicated per-namespace tier-1 topologies. Note that in any case, for TAS app and Kubernetes service SNAT rules, the firewall match behavior will be updated as soon as the set of containers/services for the app/service changes.

Workaround: Add firewall rules on the tier-0 gateway firewall section to allow traffic coming from containers' IP ranges.
Issue 3396348: SNAT IP addresses for namespaces or orgs changed after policy migration

In rare circumstances, there might be a change in the SNAT IP configured in NSX SNAT rules for Kubernetes namespaces or SNAT orgs. This might cause a disruption when upstream firewalls are configured to allow exclusively certain IP address.
Workaround: Restore the previous configuration via NSX API after shutting down all NCP instances:
1. Use IP pool APIs to allocate the desired IP address.
2. Identify the SNAT rule via tags, note the current IP address, and update the translated_network attribute to use the desired IP address.
3. Remove the IP allocation for the current IP address using IP Pool APIs.
4. Restart the NCP instances.
Issue 3370820: NAT firewall match option not available in TAS for NSX Policy Integration

TAS users do not have any control over how NSX gateway firewall behaves regarding the SNAT rules that are created. All SNAT rules created for orgs and apps will match internal addresses, meaning that gateway firewall rules will be evaluated on the source IP address before SNAT occurs. TAS users do not have the ability to choose "match external address" (evaluate rules after SNAT), or "bypass" (do not evaluate rules).

Workaround: None.