VMware NSX Container Plugin 3.2.1.8 Release Notes

VMware NSX Container Plugin 3.2.1.8 \| 30 NOV 2023 \| Build 22785276 Check for additions and updates to these release notes.

VMware NSX Container Plugin 3.2.1.8 | 30 NOV 2023 | Build 22785276

Check for additions and updates to these release notes.

What's New

This is an update release that resolves issues found in earlier releases.

Deprecation Notice

The annotation ncp/whitelist-source-range will be deprecated in NCP 4.0. Starting with NCP 3.1.1, you can use the annotation "ncp/allowed-source-range" instead.

The feature that allows access via NAT to Ingress controller pods using the ncp/ingress_controller annotation is deprecated and will be removed in 2023. The recommended way to expose Ingress controller pods is to use services of type LoadBalancer.

Compatibility Requirements

Product	Version
NCP/NSX-T Tile for Tanzu Application Service (TAS)	3.2.1
NSX-T	3.1.3, 3.2, 3.2.1 (see notes below)
vSphere	6.7, 7.0
Kubernetes	1.21, 1.22, 1.23
OpenShift 4	4.7, 4.8, 4.9
OpenShift Host VM OS	RHCOS 4.7, 4.8
Kubernetes Host VM OS	Ubuntu 18.04, 20.04 CentOS 8.2 RHEL 8.4, 8.5 See notes below.
Tanzu Application Service	Ops Manager 2.10 + TAS 2.11 Ops Manager 2.10 + TAS 2.12 (End of support date: 31 March 2023)
Tanzu Kubernetes Grid Integrated (TKGI)	1.13.6, 1.14

Notes:

The installation of the nsx-ovs kernel module on CentOS/RHEL requires a specific kernel version. The supported CentOS/RHEL kernel versions are 193, 305, and 348, regardless of the CentOS/RHEL version. Note that the default kernel version is 193 for RHEL 8.2, 305 for RHEL 8.4, and 348 for RHEL 8.5. If you are running a different kernel version, you can (1) Modify your kernel version to one that is supported. When modifying the kernel version and then restarting the VM, make sure that the IP and static routes are persisted on the uplink interface (specified by ovs_uplink_port) to guarantee that connectivity to the Kubernetes API server is not lost. Or (2) Skip the installation of the nsx-ovs kernel module by setting "use_nsx_ovs_kernel_module" to "False" under the "nsx_node_agent" section in the nsx-node-agent config map.

To run the nsx-ovs kernel module on RHEL/CentOS, you must disable the "UEFI secure boot" option under "Boot Options" in the VM's settings in vCenter Server.

Starting with NCP 3.1.2, the RHEL image will not be distributed. For all supported integrations, use the Red Hat Universal Base Image (UBI). For more information, see https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image.

TKGI 1.14.0 shipped with NCP 3.2.1.0, which does not support NSX-T 3.2.1.

TKGI 1.13.x and TKGI 1.14.x are not compatible with NSX-T 3.2.0.x.

Support for upgrading to this release:

All 3.1.x releases
All previous 3.2.x releases

Limitations

The "baseline policy" feature for NCP creates a dynamic group which selects all members in the cluster. NSX-T has a limit of 8,000 effective members of a dynamic group (for details, see Configuration Maximums). Therefore, this feature should not be enabled for clusters that are expected to grow beyond 8,000 pods. Exceeding this limit can cause delays in the creation of resources for the pods.

Resolved Issues

Issue 3283283: NCP does not create firewall rules for TAS Network Policy if there is a duplicate NsGroup found for an Application
In some cases, immediately after a NSX upgrade to version 3.2.x, changes in TAS network policies are not being implemented on NSX. Network Policy synchronization fails with a message such as the following:
```
nsx_ujo.common.controller PolicyController worker 0 failed to sync <policy_id> due to
 multiple object exception: Multiple AppNsGroup objects were found for 
{'app_id': <app_id>'}: [<nsg_id> (app_id: <app_id>)', '<nsg_id> (app_id: <app_id>)']
```
Workaround: Based on the error message, use NSX API to delete the NsGroup that is not associated with any firewall rule.
Issue 3218438: New container does not run because gateway is missing, logical router port is not created for a new logical switch

In Manager mode, TAS application instance will not start for a specific org (TKGI pods stuck in ContainerCreating for a specific namespace). One or more logical switches for the org (namespace) will not have an uplink to any tier-1 router. In addition, there may be several logical switches with the same name for a given org (namespace).

Workaround: Use NSX API to delete all the logical switches for the org (namespace) which do not have any logical router port.

Issue 3089803: NCP constantly updates NSX firewall rule for TAS ASG

If there is a '/32' suffix in the address of ASG rule, NCP will update the firewall rule with the same format on NSX, but NSX does not keep '/32' in the address. NCP detects a discrepancy between the ASG rule and the NSX firewall configuration, and will repeatedly updates the NSX firewall. This might cause delays in processing updates for other TAS resources. The NCP log shows messages such as the following:

2023-01-12T08:53:09.929Z 0b666fbb-b54e-493b-a3ae-4a8d4a0bfa35 NSX 15378 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] nsx_ujo.ncp.nsx.manager.firewall_service Detected changes between FW rules ('07098149-8ea4-4de2-95e2-868277654a91', 'nsgroup') (asg_id: 5316e97f-a0c1-409b-b5be-bdbaf0d93bf4, fws_id: 07098149-8ea4-4de2-95e2-868277654a91, rules: ...

2023-01-12T08:53:43.528Z 0b666fbb-b54e-493b-a3ae-4a8d4a0bfa35 NSX 15378 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] nsx_ujo.ncp.nsx.manager.firewall_service Updating firewall section 07098149-8ea4-4de2-95e2-868277654a91 for ASG 5316e97f-a0c1-409b-b5be-bdbaf0d93bf4

Workaround: Remove the '/32' suffix from the address in the ASG rule.

Issue 3242478: nsx-node-agent cannot enter its own network namespace to establish hyperbus channel

In rare occasions, when nsx-node-agent starts, it cannot enter the network namespace, so the hyperbus channel cannot be established. The nsx-node-agent log shows messages such as the following:

2023-06-29T11:48:04.645Z cba56c49-2eed-4cf7-af2f-f34cf619f00e NSX 5506 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] oslo.privsep.daemon privsep log:   File "/usr/local/lib/python3.8/dist-packages/nsx_ujo/agent/nsxrpc_client.py", line 49, in accept
2023-06-29T11:48:04.645Z cba56c49-2eed-4cf7-af2f-f34cf619f00e NSX 5506 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] oslo.privsep.daemon privsep log:     netns.setns(agent_ns, os.O_EXCL)
2023-06-29T11:48:04.645Z cba56c49-2eed-4cf7-af2f-f34cf619f00e NSX 5506 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] oslo.privsep.daemon privsep log:   File "/usr/local/lib/python3.8/dist-packages/pyroute2/netns/__init__.py", line 338, in setns
2023-06-29T11:48:04.645Z cba56c49-2eed-4cf7-af2f-f34cf619f00e NSX 5506 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] oslo.privsep.daemon privsep log:     raise OSError(ctypes.get_errno(), 'failed to open netns', netns)
2023-06-29T11:48:04.645Z cba56c49-2eed-4cf7-af2f-f34cf619f00e NSX 5506 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] oslo.privsep.daemon privsep log: OSError: [Errno 22] failed to open netns: '/var/run/netns/nsx-node-agent’

Workaround: Restart nsx-node-agent.

Issue 3256562: NCP updates NSX IPSet with empty IP address after restart or leader election
In Manager mode, when there is a leadership change, connectivity between pods may be broken. Running the following CLI command on the node where NCP master instance is running shows that there is empty IP address.
```
$ /var/vcap/jobs/ncp/bin/nsxcli -c "get ncp-store ip_set_store"
```
Restart NCP. To avoid this issue when leader election happens, increase the HA master_timeout value in NCP ConfigMap.
```
[ha]   
enable = True   
master_timeout = 60 
```

Known Issues

Issue 3396034: Manager-to-policy migration fails because the migration process cannot infer the Service UUID needed to migrate NCP-created load balancer pools

During a manager-to-policy migration, tags on certain NSX resources created by NCP need to be updated. This operation may require specific Kubernetes resources to exist. If these resources do not exist, migration fails and all NSX resources are rollbacked to manager mode. In this case, manager-to-policy migration fails because Ingress has rules that use a Service that does not exist in Kubernetes.

Workaround: Remove the Ingress rules that are using Kubernetes Services that no longer exist.
Issue 3239352: In a TAS environment, when a Task cannot be allocated, retry may not work

In an NCP TAS environment, when a Task cannot be allocated the Auctioneer rejects the task and the BBS retries placement of the task up to the number of times specified by the setting task.max_retries. When task.max_retries is reached, the BBS updates the Task from the PENDING state to the COMPLETED state, marking it as Failed and including a FailureReason that explains that the cluster has no capacity for the task.

During retry, the task may be scheduled to a new cell which notifies NCP with a task_changed event. Since NCP does not handle the task_changed event the task cannot be assigned a new port in the new cell. The task cannot run properly.

Workaround: Disable the retry and set the task.max_retries value to 0.
Issue 2131494: NGINX Kubernetes Ingress still works after changing the Ingress class from nginx to nsx

When you create an NGINX Kubernetes Ingress, NGINX create traffic forwarding rules. If you change the Ingress class to any other value, NGINX does not delete the rules and continues to apply them, even if you delete the Kubernetes Ingress after changing the class. This is a limitation of NGINX.

Workaround: To delete the rules created by NGINX, delete the Kubernetes Ingress when the class value is nginx. Than re-create the Kubernetes Ingress.
Issue 3033821: After manager-to-policy migration, distributed firewall rules not enforced correctly

After a manager-to-policy migration, newly created network policy-related distributed firewall (DFW) rules will have higher priority than the migrated DFW rules.

Workaround: Use the policy API to change the sequence of DFW rules as needed.
For a Kubernetes service of type ClusterIP, the hairpin-mode flag is not supported

NCP does not support the hairpin-mode flag for a Kubernetes service of type ClusterIP.

Workaround: None
Issue 2224218: After a service or app is deleted, it takes 2 minutes to release the SNAT IP back to the IP pool

If you delete a service or app and recreate it within 2 minutes, it will get a new SNAT IP from the IP pool.

Workaround: After deleting a service or app, wait 2 minutes before recreating it if you want to reuse the same IP.
Issue 2404302: If multiple load balancer application profiles for the same resource type (for example, HTTP) exist on NSX-T, NCP will choose any one of them to attach to the Virtual Servers.

If multiple HTTP load balancer application profiles exist on NSX-T, NCP will choose any one of them with the appropriate x_forwarded_for configuration to attach to the HTTP and HTTPS Virtual Server. If multiple FastTCP and UDP application profiles exist on NSX-T, NCP will choose any one of them to attach to the TCP and UDP Virtual Servers, respectively. The load balancer application profiles might have been created by different applications with different settings. If NCP chooses to attach one of these load balancer application profiles to the NCP-created Virtual Servers, it might break the workflow of other applications.

Workaround: None
Issue 2518111: NCP fails to delete NSX-T resources that have been updated from NSX-T

NCP creates NSX-T resources based on the configurations that you specify. If you make any updates to those NSX-T resources through NSX Manager or the NSX-T API, NCP might fail to delete those resources and re-create them when it is necessary to do so.

Workaround: Do not update NSX-T resources created by NCP through NSX Manager or the NSX-T API.
Issue 2416376: NCP fails to process a TAS ASG (App Security Group) that binds to more than 128 Spaces

Because of a limit in NSX-T distributed firewall, NCP cannot process a TAS ASG that binds to more than 128 Spaces.

Workaround: Create multiple ASGs and bind each of them to no more than 128 Spaces.
Issue 2537221: After upgrading NSX-T to 3.0, the networking status of container-related objects in the NSX Manager UI is shown as Unknown

In NSX Manager UI, the tab Inventory > Containers shows container-related objects and their status. In a TKGI environment, after upgrading NSX-T to 3.0, the networking status of the container-related objects is shown as Unknown. The issue is caused by the fact that TKGI does not detect the version change of NSX-T. This issue does not occur if NCP is running as a pod and the liveness probe is active.

Workaround: After the NSX-T upgrade, restart the NCP instances gradually (no more than 10 at the same time) so as not to overload NSX Manager.
Issue 2552564: In an OpenShift 4.3 environment, DNS forwarder might stop working if overlapping address found

In an OpenShift 4.3 environment, cluster installation requires that a DNS server be configured. If you use NSX-T to configure a DNS forwarder and there is IP address overlap with the DNS service, the DNS forwarder will stop working and cluster installation will fail.

Workaround: Configure an external DNS service, delete the cluster that failed to install and recreate the cluster.
Issue 2597423: When importing manager objects to policy, a rollback will cause the tags of some resources to be lost
When importing manager objects to policy, if a rollback is necessary, the tags of the following objects will not be restored:
- Spoofguard profiles (part of shared and cluster resources)
- BgpneighbourConfig (part of shared resources)
- BgpRoutingConfig (part of shared resources)
- StaticRoute BfdPeer (part of shared resources)
Workaround: For resources that are part of the shared resources, manually restore the tags. Use the backup and restore feature to restore resources that are part of cluster resources.
Issue 2579968: When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools are not be deleted as expected

When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools might remain in the NSX-T environment when they should be deleted.

Workaround: Restart NCP. Alternatively, manually remove stale virtual servers and their associated resources. A virtual server is stale if no Kubernetes service of type LoadBalancer has the virtual server's identifier in the external_id tag.
NCP fails to start when "logging to file" is enabled during Kubernetes installation

This issue happens when uid:gid=1000:1000 on the container host does not have permission to the log folder.
Workaround: Do one of the following:
- Change the mode of the log folder to 777 on the container hosts.
- Grant “rwx” permission of the log folder to uid:gid=1000:1000 on the container hosts.
- Disable the “logging to file” feature.
Issue 2653214: Error while searching the segment port for a node after the node's IP address was changed

After changing a node's IP address, if you upgrade NCP or if the NCP operator pod is restarted, checking the NCP operator status with the command "oc describe co nsx-ncp" will show the error message "Error while searching segment port for node ..."

Workaround: None. Adding a static IP address on a node interface which also has DHCP configuration is not supported.
Issue 2672677: In a highly stressed OpenShift 4 environment, a node can become unresponsive

In an OpenShift 4 environment with a high level of pod density per node and a high frequency of pods getting deleted and created, a RHCOS node might go into a "Not Ready" state. Pods running on the affected node, with the exception of daemonset members, will be evicted and recreated on other nodes in the environment.

Workaround: Reboot the impacted node.
Issue 2707174: A Pod that is deleted and recreated with the same namespace and name has no network connectivity

If a Pod is deleted and recreated with the same namespace and name when NCP is not running and nsx-ncp-agents are running, the Pod might get wrong network configurations and not be able to access the network.

Workaround: Delete the Pod and recreate it when NCP is running.
Issue 2745904: The feature "Use IPSet for default running ASG" does not support removing or replacing an existing container IP block

If you enable "Use IPSet for default running ASG" on an NCP tile, NCP will create a dedicated NSGroup for all the container IP blocks configured by "IP Blocks of Container Networks" on the same NCP tile. This NSGroup will be used in the firewall rules created for global running ASGs to allow traffic for all the containers. If you later remove or replace an existing container IP block, it will be removed or replaced in the NSGroup. All the existing containers in the original IP block will no longer be associated with the global running ASGs. Their traffic might no longer work.

Workaround: Only append new IP blocks to "IP Blocks of Container Networks".
Issue 2745907: "monit" commands return incorrect status information for nsx-node-agent

On a diego_cell VM, when monit restarts nsx-node-agent, if it takes more than 30 seconds for nsx-node-agent to fully start, monit will show the status of nsx-node-agent as "Execution failed" and will not update its status to "running" even when nsx-node-agent is fully functional later.

Workaround: None.
Issue 2735244: nsx-node-agent and nsx-kube-proxy crash because of liveness probe failure

nsx-node-agent and nsx-kube-proxy use sudo to run some commands. If there are many entries in /etc/resolv.conf about DNS server and search domains, sudo can take a long time to resolve hostnames. This will cause nsx-node-agent and nsx-kube-proxy to be blocked by the sudo command for a long time, and liveness probe will fail.
Workaround: Perform one of the two following actions:
- Add hostname entries to /etc/hosts. For example, if hostname is 'host1', add the entry '127.0.0.1 host1'.
- Set a larger value for the nsx-node-agent liveness probe timeout. Run the command 'kubectl edit ds nsx-node-agent -n nsx-system' to update the timeout value for both the nsx-node-agent and nsx-kube-proxy containers.
Issue 2736412: Parameter members_per_small_lbs is ignored if max_allowed_virtual_servers is set

If both max_allowed_virtual_servers and members_per_small_lbs are set, virtual servers may fail to attach to an available load balancer because only max_allowed_virtual_servers is taken into account.

Workaround: Relax the scale constraints instead of enabling auto scaling.
Issue 2740552: When deleting a static pod using api-server, nsx-node-agent does not remove the pod's OVS bridge port, and the network of the static pod which is re-created automatically by Kubernetes is unavailable

Kubernetes does not allow removing a static pod by api-server. A mirror pod of static pod is created by Kubernetes so that the static pod can be searched by api-server. While deleting the pod by api-server, only the mirror pod will be deleted and NCP will receive and handle the delete request to remove all NSX resource allocated for the pod. However, the static pod still exists, and nsx-node-agent will not get the delete request from CNI to remove OVS bridge port of static pod.

Workaround: Remove the static pod by deleting the manifest file instead of removing the static pod by api-server.
Issue 2795482: Running pod stuck in ContainerCreating state after node/hypervisor reboot or any other operation

If the wait_for_security_policy_sync flag is true, a pod can go to ContainerCreating state after being in running state for more than one hour because of a worker node hard reboot, hypervisor reboot, or some other reason. The pod will be in the creating state forever.

Workaround: Delete and recreate the pod.
Issue 2860091: DNS traffic fails if baseline_policy_type is set to allow_namespace

In an OpenShift or Kubernetes environment, if baseline_policy_type is set to allow_namespace, it will block pods (hostNetwork: False) in other namespaces from accessing the DNS service.

Workaround: Add a rule network policy to allow traffic from other pods to the DNS pods.
Issue 2841030: With Kubernetes 1.22, the status of nsx-node-agent is always 'AppArmor'

With Kubernetes 1.22, when the nsx-node-agent pods are "Ready", their status is not updated from "AppArmor" to "Running". This does not impact the functionality of NCP or nsx-node-agent.

Workaround: Restart the nsx-node-agent pods.
Issue 2824129: A node has the status network-unavailable equal to true for more than 3 minutes after a restart

If you use NCP operator to manage NCP's lifecycle, when an nsx-node-agent daemonset recovers from a non-running state, its node will have the status network-unavailable equal to true until it has been running for 3 minutes. This is expected behavior.

Workaround: Wait for at least 3 minutes after nsx-node-agent restarts.
Issue 2867361: nsx-node-agent and hyperbus alarms not removed after NCP cleanup

If nsx-node-agent and hyperbus alarms appear for some reason (such as stopping all NSX node agents), and you stop NCP and run the cleanup script, the alarms will remain after the cleanup.

Workaround: None
Issue 2868572: Open vSwitch (OVS) must be disabled on host VM before running NCP
To deploy NCP on a host VM, you must first stop OVS-related processes and delete some files on the host using the following commands:
1. sudo systemctl disable openvswitch-switch.service
2. sudo systemctl stop openvswitch-switch.service
3. rm -rf /var/run/openvswitch
If you have already deployed NCP on a host VM, and OVS is not running correctly, perform the following steps to recover:
1. Perform the above 3 steps.
2. Delete nsx-node-agent pods on the nodes having the issue to restart the node agent pods with the command "kubectl delete pod $agent-pod -n nsx-system".
Workaround: See above.
Issue 2832480: For a Kubernetes service of type ClusterIP, sessionAffinityConfig.clientIP.timeoutSeconds cannot exceed 65535

For a Kubernetes service of type ClusterIP, if you set sessionAffinityConfig.clientIP.timeoutSeconds to a value greater than 65535, the actual value will be 65535.

Workaround: None
Issue: 2940772: Migrating NCP resources from Manager to Policy results in failure with NSX-T 3.2.0

Migrating NCP resources from Manager to Policy is supported with NSX-T 3.1.3 and NSX-T 3.2.1, but not NSX-T 3.2.0.

Workaround: None
Issue 2934195: Some types of NSX groups are not supported for distributed firewall rules

An NSX groups of type "IP Addresses Only" is not supported for distributed firewall (DFW) rules. An NSX group of type "Generic" with manually added IP addresses as members is also not supported.

Workaround: None
Issue 2936436: NSX Manager UI does not show the NCP version on the container cluster page

When NSX Manager UI displays the container clusters in the inventory tab, the NCP version is not displayed.

Workaround: The NCP version is available by calling the API /policy/api/v1/fabric/container-clusters.
Issue 2939886: Migrating objects from Manager Mode to Policy Mode fails

Migrating objects from Manager Mode to Policy Mode fails if, in the network policy specification, egress and ingress have the same selector.

Workaround: None
Issue 2923436: Long Kubernetes resource name causes failure
If a Kubernetes resource name is too long, the corresponding NSX resource cannot be created because the NSX resource name will exceed the limits for display names in NSX. The log will show an error message such as "Field level validation errors: {display_name ipp-k8scl-two-aaaaaa... has exceeded its maximum valid length 255 characters}". NSX has the following limits:
- segment display name: 80 characters
- group name + domain name: 245 characters
- other NSX resources display name: 255 characters
Workaround: Make the Kubernetes resource name shorter.
Issue: 2961789: After migrating manager objects to policy, some of the health-check pod's related resources cannot be deleted

After migrating manager objects to policy, when you delete the health-check pod, the pod's related segment port and the distributed firewall rule's target group are not deleted.

Workaround: Manually delete those resources.
Issue: 2966586: After migrating manager objects to policy, namespace creation fails

If an IP block is created in manager mode, after manager objects are migrated to policy, namespace creation fails because NCP cannot allocate subnets from this IP block.

Workaround: Create new IP blocks in policy mode and configure NCP to use these new IP blocks.
Issue 2972811: In a large-scale environment, the hyperbus connection to some worker nodes is down
In a large-scale environment, pod creation can get stuck for 10-15 minutes due to rpc channel timeout. The following issues may occur:
- In a Kubernetes cluster, some pods will have the status ContainerCreating for 10-15 minutes.
- In cfgAgent, the tunnel will have the status COMMUNICATION_ERROR for 10-15 minutes.
- In NSX UI, there may be an alarm generated which indicate hyperbus connection down.
Workaround: None needed. This issue will automatically recover after 10-15 minutes.
Issue 2960121: For services of type LoadBalancer connectivity to pods on windows worker nodes fails if not configured correctly

For services of type LoadBalancer connectivity to pods on Windows worker nodes will fail if NCP is configured to use the default LB segment subnet. The default subnet 169.254.128.0/22 belongs to the IPv4 link-local space and is not forwarded on a Windows node.

Workaround: Configure NCP to use a non-default LB segment subnet. To do this, set the parameter lb_segment_subnet in the nsx_v3 section. Note that this will only have effect on newly created NSX load balancers.