VMware NSX Container Plugin 4.1.0 | 28 FEB 2023 | Build 21302845
Check for additions and updates to these release notes.
VMware NSX Container Plugin 4.1.0 | 28 FEB 2023 | Build 21302845
Check for additions and updates to these release notes.
Support for OpenShift 4.11 and various bug fixes.
The feature that allows access via NAT to Ingress controller pods using the "ncp/ingress_controller" annotation is deprecated and will be removed in 2023. The recommended way to expose Ingress controller pods is to use services of type LoadBalancer.
NCP/NSX-T Tile for Tanzu Application Service (TAS)
NSX-T 3.2.2, 3.2.3. NSX 126.96.36.199, 188.8.131.52, 4.1. (See notes below.)
6.7, 7.0, 184.108.40.206
1.24, 1.25, 1.26
4.8, 4.9, 4.10, 4.11
Kubernetes Host VM OS
Ubuntu 20.04 (only kernel version 5.15 or earlier supported)
Ubuntu 22.04 (only kernel version 5.15 or earlier supported, only upstream OVS kernel module supported)
RHEL 8.4, 8.5, 8.6
See notes below.
Tanzu Application Service
Ops Manager 2.10 + TAS 2.11 (LTS)
Ops Manager 3.0 + TAS 2.11 (LTS)
Ops Manager 2.10 + TAS 2.13
Ops Manager 3.0 + TAS 2.13
Ops Manager 2.10 + TAS 3.0 (End of support date: 31 October 2023)
Ops Manager 3.0 + TAS 3.0 (End of support date: 31 October 2023)
Tanzu Kubernetes Grid Integrated (TKGI)
NSX-T 3.2.3 is supported with basic sanity testing coverage. See Product Interoperability Matrix.
The installation of the nsx-ovs kernel module on RHEL requires a specific kernel version. The supported RHEL kernel versions are 193, 305, 348, and 372, regardless of the RHEL version. Note that the default kernel version is 193 for RHEL 8.2, 305 for RHEL 8.4, 348 for RHEL 8.5, and 372 for RHEL 8.6. If you are running a different kernel version, you can (1) Modify your kernel version to one that is supported. When modifying the kernel version and then restarting the VM, make sure that the IP and static routes are persisted on the uplink interface (specified by ovs_uplink_port) to guarantee that connectivity to the Kubernetes API server is not lost. Or (2) Skip the installation of the nsx-ovs kernel module by setting "use_nsx_ovs_kernel_module" to "False" under the "nsx_node_agent" section in the nsx-node-agent config map. For information about switching between NSX-OVS and upstream OVS kernel modules, see https://docs.vmware.com/en/VMware-NSX-Container-Plugin/4.1/ncp-kubernetes/GUID-7225DDCB-88CB-4A2D-83A3-74BB9ED7DCFF.html.
To run the nsx-ovs kernel module on RHEL, you must disable the "UEFI secure boot" option under "Boot Options" in the VM's settings in vCenter Server.
For all supported integrations, use the Red Hat Universal Base Image (UBI). For more information, https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image.
Support for upgrading to this release:
All previous 4.0.x releases
The "baseline policy" feature for NCP creates a dynamic group which selects all members in the cluster. NSX-T has a limit of 8,000 effective members of a dynamic group (for details, see Configuration Maximums). Therefore, this feature should not be enabled for clusters that are expected to grow beyond 8,000 pods. Exceeding this limit can cause delays in the creation of resources for the pods.
Transparent mode load balancer
Only north-south traffic for a Kubernetes cluster is supported. Intra-cluster traffic is not supported.
Not supported for services attached to a LoadBalancer CRD or when auto scaling is enabled. Auto scaling must be disabled for this feature to work.
It is recommended to use this feature only on newly deployed clusters.
It is not possible to migrate a Kubernetes cluster if a previous migration failed and the cluster is rolled back. This is a limitation with NSX 220.127.116.11 or earlier releases only.
Issue 3158230: nsx-ncp-bootstrap container fails to initialize while loading AppArmor profiles on Ubuntu 20.04
The nsx-ncp-bootstrap container in nsx-ncp-bootstrap DaemonSet fails to initialize because of different package versions of AppArmor on the host OS and the container image. The logs of the container show messages such as "Failed to load policy-features from '/etc/apparmor.d/abi/2.13': No such file or directory".
Workaround: Update AppArmor to version 2.13.3-7ubuntu5.2 or the latest available from focal-updates on the host OS.
Issue 3239352: In a TAS environment, when a Task cannot be allocated, retry may not work
In an NCP TAS environment, when a Task cannot be allocated the Auctioneer rejects the task and the BBS retries placement of the task up to the number of times specified by the setting task.max_retries. When task.max_retries is reached, the BBS updates the Task from the PENDING state to the COMPLETED state, marking it as Failed and including a FailureReason that explains that the cluster has no capacity for the task.
During retry, the task may be scheduled to a new cell which notifies NCP with a task_changed event. Since NCP does not handle the task_changed event the task cannot be assigned a new port in the new cell. The task cannot run properly.
Workaround: Disable the retry and set the task.max_retries value to 0.
Issue 3161931: nsx-ncp-bootstrap pod fails to run on Ubuntu 18.04 and Ubuntu 20.04 host VMs
The nsx-ncp-bootstrap container in the nsx-ncp-bootstrap pod fails to reload "AppArmor" with the following log messages: "Failed to load policy-features from '/etc/apparmor.d/abi/2.13': No such file or directory." The issue is caused by different versions of the "AppArmor" package installed in the image used to run the nsx-ncp-bootstrap pod and host OS. This issue does not exist on Ubuntu 22.04 host VMs.
Workaround: Ubuntu 18.04 is not supported. On Ubuntu 20.04, update "AppArmor" to the minimum version 2.13.3-7ubuntu5.2. The package is available via focal-updates.
Issue 3113985: When migrating a single-tier1 topology, not all static routes are migrated
In a single-tier1 topology with multiple custom resources of type loadbalancers.vmware.com, some static routes created by NCP in Manager mode for the load balancers are not migrated.
Workaround: After deleting the custom resource of type loadbalancers.vmware.com from Kubernetes, manually delete the static route with Manager API. The static route will have the UID of the custom resource in its tags with scope "ncp/crd_lb_uid".
Issue 3049209: After manager-to-policy migration, deleting clusters does not delete mp_default_LR_xxx_user_rules resource
After performing a manager-to-policy migration, and then deleting clusters, some "GatewayPolicy" resources named mp_default_LR_xxxx_user_rules may not get deleted.
Workaround: Delete the resources manually.
Issue 3043496: NCP stops running if Manager-to-Policy migration fails
NCP provides the migrate-mp2p job to migrate NSX resources used by NCP and TKGI. If migration fails, all migrated resources are rolled back but NCP is not restarted in Manager mode.
Make sure that all resources were rolled back. This can be done by checking the logs of the migrate-mp2p job. The logs must end with the line "All imported MP resources to Policy completely rolled back."
If all resources were rolled back, ssh into each master node and run the command "sudo /var/vcap/bosh/bin/monit start ncp".
Issue 3055618: When creating multiple Windows pods on a node simultaneously, some pods do not have a network adapter
When applying a yaml file to create multiple Windows pods on the same node, some pods do not have a network adapter.
Workaround: Restart the pods.
Issue 2131494: NGINX Kubernetes Ingress still works after changing the Ingress class from nginx to nsx
When you create an NGINX Kubernetes Ingress, NGINX create traffic forwarding rules. If you change the Ingress class to any other value, NGINX does not delete the rules and continues to apply them, even if you delete the Kubernetes Ingress after changing the class. This is a limitation of NGINX.
Workaround: To delete the rules created by NGINX, delete the Kubernetes Ingress when the class value is nginx. Than re-create the Kubernetes Ingress.
Issue 2999131: ClusterIP services not reachable from the pods
In a large-scale TKGi environment, ClusterIP services are not reachable from the pods. Other related issues are: (1) The nsx-kube-proxy stops outputting the logs of nsx-kube-proxy; and (2) The OVS flows are not created on the node.
Workaround: Restart nsx-kube-proxy.
Issue 2984240: The "NotIn" operator in matchExpressions does not work in namespaceSelector for a network policy's rule
When specifying a rule for a network policy, if you specify namespaceSelector, matchExpressions and the "NotIn" operator, the rule does not work. The NCP log has the error message "NotIn operator is not supported in NS selectors."
Workaround: Rewrite matchExpressions to avoid using the "NotIn" operator.
Issue 2997828: Migration of cluster from Manager mode to Policy mode fails if Ingress has more than 255 rules
In Policy mode, an NSX load balancer can support a maximum of 255 rules. If a cluster has an Ingress resource that has more than 255 rules, migrating the cluster from Manager mode to Policy mode will fail.
Workaround: Create LoadBalancer CRDs to distribute the rules across multiple NSX load balancers.
Issue 3033821: After manager-to-policy migration, distributed firewall rules not enforced correctly
After a manager-to-policy migration, newly created network policy-related distributed firewall (DFW) rules will have higher priority than the migrated DFW rules.
Workaround: Use the policy API to change the sequence of DFW rules as needed.
For a Kubernetes service of type ClusterIP, the hairpin-mode flag is not supported
NCP does not support the hairpin-mode flag for a Kubernetes service of type ClusterIP.
Issue 2224218: After a service or app is deleted, it takes 2 minutes to release the SNAT IP back to the IP pool
If you delete a service or app and recreate it within 2 minutes, it will get a new SNAT IP from the IP pool.
Workaround: After deleting a service or app, wait 2 minutes before recreating it if you want to reuse the same IP.
Issue 2404302: If multiple load balancer application profiles for the same resource type (for example, HTTP) exist on NSX-T, NCP will choose any one of them to attach to the Virtual Servers.
If multiple HTTP load balancer application profiles exist on NSX-T, NCP will choose any one of them with the appropriate x_forwarded_for configuration to attach to the HTTP and HTTPS Virtual Server. If multiple FastTCP and UDP application profiles exist on NSX-T, NCP will choose any one of them to attach to the TCP and UDP Virtual Servers, respectively. The load balancer application profiles might have been created by different applications with different settings. If NCP chooses to attach one of these load balancer application profiles to the NCP-created Virtual Servers, it might break the workflow of other applications.
Issue 2518111: NCP fails to delete NSX-T resources that have been updated from NSX-T
NCP creates NSX-T resources based on the configurations that you specify. If you make any updates to those NSX-T resources through NSX Manager or the NSX-T API, NCP might fail to delete those resources and re-create them when it is necessary to do so.
Workaround: Do not update NSX-T resources created by NCP through NSX Manager or the NSX-T API.
Issue 2416376: NCP fails to process a TAS ASG (App Security Group) that binds to more than 128 Spaces
Because of a limit in NSX-T distributed firewall, NCP cannot process a TAS ASG that binds to more than 128 Spaces.
Workaround: Create multiple ASGs and bind each of them to no more than 128 Spaces.
Issue 2537221: After upgrading NSX-T to 3.0, the networking status of container-related objects in the NSX Manager UI is shown as Unknown
In NSX Manager UI, the tab Inventory > Containers shows container-related objects and their status. In a TKGI environment, after upgrading NSX-T to 3.0, the networking status of the container-related objects is shown as Unknown. The issue is caused by the fact that TKGI does not detect the version change of NSX-T. This issue does not occur if NCP is running as a pod and the liveness probe is active.
Workaround: After the NSX-T upgrade, restart the NCP instances gradually (no more than 10 at the same time) so as not to overload NSX Manager.
Issue 2552564: In an OpenShift 4.3 environment, DNS forwarder might stop working if overlapping address found
In an OpenShift 4.3 environment, cluster installation requires that a DNS server be configured. If you use NSX-T to configure a DNS forwarder and there is IP address overlap with the DNS service, the DNS forwarder will stop working and cluster installation will fail.
Workaround: Configure an external DNS service, delete the cluster that failed to install and recreate the cluster.
Issue 2597423: When importing manager objects to policy, a rollback will cause the tags of some resources to be lost
When importing manager objects to policy, if a rollback is necessary, the tags of the following objects will not be restored:
Spoofguard profiles (part of shared and cluster resources)
BgpneighbourConfig (part of shared resources)
BgpRoutingConfig (part of shared resources)
StaticRoute BfdPeer (part of shared resources)
Workaround: For resources that are part of the shared resources, manually restore the tags. Use the backup and restore feature to restore resources that are part of cluster resources.
Issue 2579968: When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools are not be deleted as expected
When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools might remain in the NSX-T environment when they should be deleted.
Workaround: Restart NCP. Alternatively, manually remove stale virtual servers and their associated resources. A virtual server is stale if no Kubernetes service of type LoadBalancer has the virtual server's identifier in the external_id tag.
NCP fails to start when "logging to file" is enabled during Kubernetes installation
This issue happens when uid:gid=1000:1000 on the container host does not have permission to the log folder.
Workaround: Do one of the following:
Change the mode of the log folder to 777 on the container hosts.
Grant “rwx” permission of the log folder to uid:gid=1000:1000 on the container hosts.
Disable the “logging to file” feature.
Issue 2653214: Error while searching the segment port for a node after the node's IP address was changed
After changing a node's IP address, if you upgrade NCP or if the NCP operator pod is restarted, checking the NCP operator status with the command "oc describe co nsx-ncp" will show the error message "Error while searching segment port for node ..."
Workaround: None. Adding a static IP address on a node interface which also has DHCP configuration is not supported.
Issue 2672677: In a highly stressed OpenShift 4 environment, a node can become unresponsive
In an OpenShift 4 environment with a high level of pod density per node and a high frequency of pods getting deleted and created, a RHCOS node might go into a "Not Ready" state. Pods running on the affected node, with the exception of daemonset members, will be evicted and recreated on other nodes in the environment.
Workaround: Reboot the impacted node.
Issue 2707174: A Pod that is deleted and recreated with the same namespace and name has no network connectivity
If a Pod is deleted and recreated with the same namespace and name when NCP is not running and nsx-ncp-agents are running, the Pod might get wrong network configurations and not be able to access the network.
Workaround: Delete the Pod and recreate it when NCP is running.
Issue 2745904: The feature "Use IPSet for default running ASG" does not support removing or replacing an existing container IP block
If you enable "Use IPSet for default running ASG" on an NCP tile, NCP will create a dedicated NSGroup for all the container IP blocks configured by "IP Blocks of Container Networks" on the same NCP tile. This NSGroup will be used in the firewall rules created for global running ASGs to allow traffic for all the containers. If you later remove or replace an existing container IP block, it will be removed or replaced in the NSGroup. All the existing containers in the original IP block will no longer be associated with the global running ASGs. Their traffic might no longer work.
Workaround: Only append new IP blocks to "IP Blocks of Container Networks".
Issue 2745907: "monit" commands return incorrect status information for nsx-node-agent
On a diego_cell VM, when monit restarts nsx-node-agent, if it takes more than 30 seconds for nsx-node-agent to fully start, monit will show the status of nsx-node-agent as "Execution failed" and will not update its status to "running" even when nsx-node-agent is fully functional later.
Issue 2735244: nsx-node-agent and nsx-kube-proxy crash because of liveness probe failure
nsx-node-agent and nsx-kube-proxy use sudo to run some commands. If there are many entries in /etc/resolv.conf about DNS server and search domains, sudo can take a long time to resolve hostnames. This will cause nsx-node-agent and nsx-kube-proxy to be blocked by the sudo command for a long time, and liveness probe will fail.
Workaround: Perform one of the two following actions:
Add hostname entries to /etc/hosts. For example, if hostname is 'host1', add the entry '127.0.0.1 host1'.
Set a larger value for the nsx-node-agent liveness probe timeout. Run the command 'kubectl edit ds nsx-node-agent -n nsx-system' to update the timeout value for both the nsx-node-agent and nsx-kube-proxy containers.
Issue 2736412: Parameter members_per_small_lbs is ignored if max_allowed_virtual_servers is set
If both max_allowed_virtual_servers and members_per_small_lbs are set, virtual servers may fail to attach to an available load balancer because only max_allowed_virtual_servers is taken into account.
Workaround: Relax the scale constraints instead of enabling auto scaling.
Issue 2740552: When deleting a static pod using api-server, nsx-node-agent does not remove the pod's OVS bridge port, and the network of the static pod which is re-created automatically by Kubernetes is unavailable
Kubernetes does not allow removing a static pod by api-server. A mirror pod of static pod is created by Kubernetes so that the static pod can be searched by api-server. While deleting the pod by api-server, only the mirror pod will be deleted and NCP will receive and handle the delete request to remove all NSX resource allocated for the pod. However, the static pod still exists, and nsx-node-agent will not get the delete request from CNI to remove OVS bridge port of static pod.
Workaround: Remove the static pod by deleting the manifest file instead of removing the static pod by api-server.
Issue 2795482: Running pod stuck in ContainerCreating state after node/hypervisor reboot or any other operation
If the wait_for_security_policy_sync flag is true, a pod can go to ContainerCreating state after being in running state for more than one hour because of a worker node hard reboot, hypervisor reboot, or some other reason. The pod will be in the creating state forever.
Workaround: Delete and recreate the pod.
Issue 2841030: With Kubernetes 1.22, the status of nsx-node-agent is always 'AppArmor'
With Kubernetes 1.22, when the nsx-node-agent pods are "Ready", their status is not updated from "AppArmor" to "Running". This does not impact the functionality of NCP or nsx-node-agent.
Workaround: Restart the nsx-node-agent pods.
Issue 2824129: A node has the status network-unavailable equal to true for more than 3 minutes after a restart
If you use NCP operator to manage NCP's lifecycle, when an nsx-node-agent daemonset recovers from a non-running state, its node will have the status network-unavailable equal to true until it has been running for 3 minutes. This is expected behavior.
Workaround: Wait for at least 3 minutes after nsx-node-agent restarts.
Issue 2868572: Open vSwitch (OVS) must be disabled on host VM before running NCP
To deploy NCP on a host VM, you must first stop OVS-related processes and delete some files on the host using the following commands:
sudo systemctl disable openvswitch-switch.service
sudo systemctl stop openvswitch-switch.service
rm -rf /var/run/openvswitch
If you have already deployed NCP on a host VM, and OVS is not running correctly, perform the following steps to recover:
Perform the above 3 steps.
Delete nsx-node-agent pods on the nodes having the issue to restart the node agent pods with the command "kubectl delete pod $agent-pod -n nsx-system".
Workaround: See above.
Issue 2832480: For a Kubernetes service of type ClusterIP, sessionAffinityConfig.clientIP.timeoutSeconds cannot exceed 65535
For a Kubernetes service of type ClusterIP, if you set sessionAffinityConfig.clientIP.timeoutSeconds to a value greater than 65535, the actual value will be 65535.
Issue: 2940772: Migrating NCP resources from Manager to Policy results in failure with NSX-T 3.2.0
Migrating NCP resources from Manager to Policy is supported with NSX-T 3.1.3 and NSX-T 3.2.1, but not NSX-T 3.2.0.
Issue 2934195: Some types of NSX groups are not supported for distributed firewall rules
An NSX groups of type "IP Addresses Only" is not supported for distributed firewall (DFW) rules. An NSX group of type "Generic" with manually added IP addresses as members is also not supported.
Issue 2936436: NSX Manager UI does not show the NCP version on the container cluster page
When NSX Manager UI displays the container clusters in the inventory tab, the NCP version is not displayed.
Workaround: The NCP version is available by calling the API /policy/api/v1/fabric/container-clusters.
Issue 2939886: Migrating objects from Manager Mode to Policy Mode fails
Migrating objects from Manager Mode to Policy Mode fails if, in the network policy specification, egress and ingress have the same selector.
Issue: 2961789: After migrating manager objects to policy, some of the health-check pod's related resources cannot be deleted
After migrating manager objects to policy, when you delete the health-check pod, the pod's related segment port and the distributed firewall rule's target group are not deleted.
Workaround: Manually delete those resources.
Issue: 2966586: After migrating manager objects to policy, namespace creation fails
If an IP block is created in manager mode, after manager objects are migrated to policy, namespace creation fails because NCP cannot allocate subnets from this IP block.
Workaround: Create new IP blocks in policy mode and configure NCP to use these new IP blocks.
Issue 2972811: In a large-scale environment, the hyperbus connection to some worker nodes is down
In a large-scale environment, pod creation can get stuck for 10-15 minutes due to rpc channel timeout. The following issues may occur:
In a Kubernetes cluster, some pods will have the status ContainerCreating for 10-15 minutes.
In cfgAgent, the tunnel will have the status COMMUNICATION_ERROR for 10-15 minutes.
In NSX UI, there may be an alarm generated which indicate hyperbus connection down.
Workaround: None needed. This issue will automatically recover after 10-15 minutes.
Issue 2960121: For services of type LoadBalancer connectivity to pods on windows worker nodes fails if not configured correctly
For services of type LoadBalancer connectivity to pods on Windows worker nodes will fail if NCP is configured to use the default LB segment subnet. The default subnet 169.254.128.0/22 belongs to the IPv4 link-local space and is not forwarded on a Windows node.
Workaround: Configure NCP to use a non-default LB segment subnet. To do this, set the parameter lb_segment_subnet in the nsx_v3 section. Note that this will only have effect on newly created NSX load balancers.
Issue 3088138: After setting log_file in nsx-node-agent-config configmap, nsx-node-agent pods fails to start
If you set the log_file option in nsx-node-agent-config configmap and restart the nsx-ncp-bootstrap pods before the nsx-node-agent pods, the nsx-node-agent pods will fail to start and be in the CrashLoopBackOff state.
Workaround: Restart the nsx-node-agent pods before restarting the nsx-ncp-bootstrap pods after setting the log_file option in nsx-node-agent-config configmap.
Issue 3091318: Pod creation fails after updating a namespace's static subnet when NCP is down
If you create a namespace with ncp/subnets set, for example, to 18.104.22.168/29, and no pods have been created in the namespace yet, and you stop NCP and update ncp/subnets to, for example, 22.214.171.124/29, after NCP is restarted, creating a pod in the namespace may fail, with the pod stuck in the "ContainerCreating" state.
Workaround: Recreate the pod.
Issue 3082030: Deployment of application instances on a Diego Cell fails after an NSX cfgAgent service restart
In some cases, after an NSX cfgAgent service restart on the hypervisor, the nsx-node-agent instance running in a Diego Cell might reject hyperbus connections. As a result, deployment of application instances on the Diego Cell will fail. The Diego Cell will still be in HEALTHY state. NSX-node-agent logs on the Diego cell will have a log message such as "nsx_ujo.agent.agent Agent is exiting as connection is unavailable." NSX syslog on the ESXi host will have a message such as "[nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-net" tid="XXXXXX" level="warn"] StreamConnection[YYY Connecting to tcp://169.254.1.XX:2345 sid:YYY] Couldn't connect to 'tcp://169.254.1.XX:2345' (error: 111-Connection refused)."
Open a SSH session to the affected Diego Cell and restart the nsx-node-agent service with the command "monit restart nsx-node-agent"
Rebuild the affected Diego Cell.
Issue 3066449: Namespace subnets are not always allocated from the first available IP block when use_ip_blocks_in_order is set to True
When creating multiple namespaces with use_ip_blocks_in_order set to True, the first namespace's subnet is sometimes not allocated from the first available IP block. For example, assume that container_ip_blocks = '126.96.36.199/28,188.8.131.52/28', and subnet prefix length is 29，and subnet 184.108.40.206/29 is already allocated. If you create 2 namespaces ns-1 and ns-2, the subnets allocation could be (1) ns-1: 220.127.116.11/29, ns-2: 18.104.22.168/29, or (2) ns-1: 22.214.171.124/29, ns-2: 126.96.36.199/29.
The use_ip_blocks_in_order parameter only ensures that different IP blocks are used in the order they appear in the container_ip_blocks parameter. When creating multiple namespaces at the same time, any namespace may request a subnet through an API call before another namespace. Therefore, there is no guarantee that a specific namespace will be allocated a subnet from a specific IP block.
Workaround: Create the namespaces separately, that is, create the first namespace, make sure its subnet has been allocated, and then create the next namespace.
Issue 3110833: Pods on TKGI Windows worker node cannot start, status is "ContainerCreating"
Every node on the Windows worker node fails to start. The nsx-node-agent log on the node continuously reports "Failed to process cif config request with error [...]. Restart node agent service to recover."
Workaround: Restart the nsx-node-agent service on the node.