VMware NSX Container Plugin 4.0.1 | 08 NOV 2022 | Build 20665035
Check for additions and updates to these release notes.
VMware NSX Container Plugin 4.0.1 | 08 NOV 2022 | Build 20665035
Check for additions and updates to these release notes.
Manager-to-Policy migration is now supported for TKGI cluster. This will be available with the first TKGI release which will bundle NCP 4.0.1. This feature requires NSX 4.0.1 or later.
The feature that allows access via NAT to Ingress controller pods using the "ncp/ingress_controller" annotation is deprecated and will be removed in 2023. The recommended way to expose Ingress controller pods is to use services of type LoadBalancer.
NCP/NSX-T Tile for Tanzu Application Service (TAS)
NSX-T / NSX
NSX-T 3.2.2. NSX 188.8.131.52, 184.108.40.206. (See notes below.)
220.127.116.11 (with NSX 4.0.1 only)
1.23, 1.24, 1.25
4.8, 4.9, 4.10
Kubernetes Host VM OS
Ubuntu 18.04, 20.04
RHEL 8.4, 8.5, 8.6 (only upstream OVS)
See notes below.
Tanzu Application Service
Ops Manager 2.10 + TAS 2.11 (LTS)
Ops Manager 2.10 + TAS 2.13
Ops Manager 2.10 + TAS 3.0
Tanzu Kubernetes Grid Integrated (TKGI)
The installation of the nsx-ovs kernel module on RHEL requires a specific kernel version. The supported RHEL kernel versions are 193, 305, and 348, regardless of the RHEL version. Note that the default kernel version is 193 for RHEL 8.2, 305 for RHEL 8.4, and 348 for RHEL 8.5. If you are running a different kernel version, you can (1) Modify your kernel version to one that is supported. When modifying the kernel version and then restarting the VM, make sure that the IP and static routes are persisted on the uplink interface (specified by ovs_uplink_port) to guarantee that connectivity to the Kubernetes API server is not lost. Or (2) Skip the installation of the nsx-ovs kernel module by setting "use_nsx_ovs_kernel_module" to "False" under the "nsx_node_agent" section in the nsx-node-agent config map. For information about switching between NSX-OVS and upstream OVS kernel modules, see https://docs.vmware.com/en/VMware-NSX-Container-Plugin/4.0/ncp-kubernetes/GUID-7225DDCB-88CB-4A2D-83A3-74BB9ED7DCFF.html.
Note: This NCP release does not support the RHEL 8.6 kernel 372 nsx-ovs kernel module.
To run the nsx-ovs kernel module on RHEL, you must disable the "UEFI secure boot" option under "Boot Options" in the VM's settings in vCenter Server.
For all supported integrations, use the Red Hat Universal Base Image (UBI). For more information, https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image.
Support for upgrading to this release:
All previous 3.2.x releases
The "baseline policy" feature for NCP creates a dynamic group which selects all members in the cluster. NSX-T has a limit of 8,000 effective members of a dynamic group (for details, see Configuration Maximums). Therefore, this feature should not be enabled for clusters that are expected to grow beyond 8,000 pods. Exceeding this limit can cause delays in the creation of resources for the pods.
Transparent mode load balancer
Only north-south traffic for a Kubernetes cluster is supported. Intra-cluster traffic is not supported.
Not supported for services attached to a LoadBalancer CRD or when auto scaling is enabled. Auto scaling must be disabled for this feature to work.
It is recommended to use this feature only on newly deployed clusters.
It is not possible to migrate a Kubernetes cluster if a previous migration failed and the cluster is rolled back. This is a limitation with NSX 18.104.22.168 or earlier releases only.
Issue 2131494: NGINX Kubernetes Ingress still works after changing the Ingress class from nginx to nsx
When you create an NGINX Kubernetes Ingress, NGINX create traffic forwarding rules. If you change the Ingress class to any other value, NGINX does not delete the rules and continues to apply them, even if you delete the Kubernetes Ingress after changing the class. This is a limitation of NGINX.
Workaround: To delete the rules created by NGINX, delete the Kubernetes Ingress when the class value is nginx. Than re-create the Kubernetes Ingress.
Issue 3055618: When creating multiple Windows pods on a node simultaneously, some pods do not have a network adapter
When applying a yaml file to create multiple Windows pods on the same node, some pods do not have a network adapter.
Workaround: Restart the pods.
Issue 3043496: NCP stops running if Manager-to-Policy migration fails
NCP provides the migrate-mp2p job to migrate NSX resources used by NCP and TKGI. If migration fails, all migrated resources are rolled back but NCP is not restarted in Manager mode.
Make sure that all resources were rolled back. This can be done by checking the logs of the migrate-mp2p job. The logs must end with the line "All imported MP resources to Policy completely rolled back."
If all resources were rolled back, ssh into each master node and run the command "sudo /var/vcap/bosh/bin/monit start ncp".
Issue 2999131: ClusterIP services not reachable from the pods
In a large-scale TKGi environment, ClusterIP services are not reachable from the pods. Other related issues are: (1) The nsx-kube-proxy stops outputting the logs of nsx-kube-proxy; and (2) The OVS flows are not created on the node.
Workaround: Restart nsx-kube-proxy.
Issue 2984240: The "NotIn" operator in matchExpressions does not work in namespaceSelector for a network policy's rule
When specifying a rule for a network policy, if you specify namespaceSelector, matchExpressions and the "NotIn" operator, the rule does not work. The NCP log has the error message "NotIn operator is not supported in NS selectors."
Workaround: Rewrite matchExpressions to avoid using the "NotIn" operator.
Issue 2997828: Migration of cluster from Manager mode to Policy mode fails if Ingress has more than 255 rules
In Policy mode, an NSX load balancer can support a maximum of 255 rules. If a cluster has an Ingress resource that has more than 255 rules, migrating the cluster from Manager mode to Policy mode will fail.
Workaround: Create LoadBalancer CRDs to distribute the rules across multiple NSX load balancers.
Issue 3033821: After manager-to-policy migration, distributed firewall rules not enforced correctly
After a manager-to-policy migration, newly created network policy-related distributed firewall (DFW) rules will have higher priority than the migrated DFW rules.
Workaround: Use the policy API to change the sequence of DFW rules as needed.
For a Kubernetes service of type ClusterIP, the hairpin-mode flag is not supported
NCP does not support the hairpin-mode flag for a Kubernetes service of type ClusterIP.
Issue 2224218: After a service or app is deleted, it takes 2 minutes to release the SNAT IP back to the IP pool
If you delete a service or app and recreate it within 2 minutes, it will get a new SNAT IP from the IP pool.
Workaround: After deleting a service or app, wait 2 minutes before recreating it if you want to reuse the same IP.
Issue 2404302: If multiple load balancer application profiles for the same resource type (for example, HTTP) exist on NSX-T, NCP will choose any one of them to attach to the Virtual Servers.
If multiple HTTP load balancer application profiles exist on NSX-T, NCP will choose any one of them with the appropriate x_forwarded_for configuration to attach to the HTTP and HTTPS Virtual Server. If multiple FastTCP and UDP application profiles exist on NSX-T, NCP will choose any one of them to attach to the TCP and UDP Virtual Servers, respectively. The load balancer application profiles might have been created by different applications with different settings. If NCP chooses to attach one of these load balancer application profiles to the NCP-created Virtual Servers, it might break the workflow of other applications.
Issue 2518111: NCP fails to delete NSX-T resources that have been updated from NSX-T
NCP creates NSX-T resources based on the configurations that you specify. If you make any updates to those NSX-T resources through NSX Manager or the NSX-T API, NCP might fail to delete those resources and re-create them when it is necessary to do so.
Workaround: Do not update NSX-T resources created by NCP through NSX Manager or the NSX-T API.
Issue 2416376: NCP fails to process a TAS ASG (App Security Group) that binds to more than 128 Spaces
Because of a limit in NSX-T distributed firewall, NCP cannot process a TAS ASG that binds to more than 128 Spaces.
Workaround: Create multiple ASGs and bind each of them to no more than 128 Spaces.
Issue 2537221: After upgrading NSX-T to 3.0, the networking status of container-related objects in the NSX Manager UI is shown as Unknown
In NSX Manager UI, the tab Inventory > Containers shows container-related objects and their status. In a TKGI environment, after upgrading NSX-T to 3.0, the networking status of the container-related objects is shown as Unknown. The issue is caused by the fact that TKGI does not detect the version change of NSX-T. This issue does not occur if NCP is running as a pod and the liveness probe is active.
Workaround: After the NSX-T upgrade, restart the NCP instances gradually (no more than 10 at the same time) so as not to overload NSX Manager.
Issue 2552564: In an OpenShift 4.3 environment, DNS forwarder might stop working if overlapping address found
In an OpenShift 4.3 environment, cluster installation requires that a DNS server be configured. If you use NSX-T to configure a DNS forwarder and there is IP address overlap with the DNS service, the DNS forwarder will stop working and cluster installation will fail.
Workaround: Configure an external DNS service, delete the cluster that failed to install and recreate the cluster.
Issue 2554357: Load balancer auto scaling does not work for IPv6
In an IPv6 environment, a Kubernetes service of type LoadBalancer will not be active when the existing load balancer scale is reached.
Workaround: Set nsx_v3.lb_segment_subnet = FE80::/10 in /var/vcap/jobs/ncp/config/ncp.ini for TKGI deployments and in nsx-ncp-configmap for others. Then restart NCP.
Issue 2597423: When importing manager objects to policy, a rollback will cause the tags of some resources to be lost
When importing manager objects to policy, if a rollback is necessary, the tags of the following objects will not be restored:
Spoofguard profiles (part of shared and cluster resources)
BgpneighbourConfig (part of shared resources)
BgpRoutingConfig (part of shared resources)
StaticRoute BfdPeer (part of shared resources)
Workaround: For resources that are part of the shared resources, manually restore the tags. Use the backup and restore feature to restore resources that are part of cluster resources.
Issue 2579968: When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools are not be deleted as expected
When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools might remain in the NSX-T environment when they should be deleted.
Workaround: Restart NCP. Alternatively, manually remove stale virtual servers and their associated resources. A virtual server is stale if no Kubernetes service of type LoadBalancer has the virtual server's identifier in the external_id tag.
NCP fails to start when "logging to file" is enabled during Kubernetes installation
This issue happens when uid:gid=1000:1000 on the container host does not have permission to the log folder.
Workaround: Do one of the following:
Change the mode of the log folder to 777 on the container hosts.
Grant “rwx” permission of the log folder to uid:gid=1000:1000 on the container hosts.
Disable the “logging to file” feature.
Issue 2653214: Error while searching the segment port for a node after the node's IP address was changed
After changing a node's IP address, if you upgrade NCP or if the NCP operator pod is restarted, checking the NCP operator status with the command "oc describe co nsx-ncp" will show the error message "Error while searching segment port for node ..."
Workaround: None. Adding a static IP address on a node interface which also has DHCP configuration is not supported.
Issue 2672677: In a highly stressed OpenShift 4 environment, a node can become unresponsive
In an OpenShift 4 environment with a high level of pod density per node and a high frequency of pods getting deleted and created, a RHCOS node might go into a "Not Ready" state. Pods running on the affected node, with the exception of daemonset members, will be evicted and recreated on other nodes in the environment.
Workaround: Reboot the impacted node.
Issue 2707174: A Pod that is deleted and recreated with the same namespace and name has no network connectivity
If a Pod is deleted and recreated with the same namespace and name when NCP is not running and nsx-ncp-agents are running, the Pod might get wrong network configurations and not be able to access the network.
Workaround: Delete the Pod and recreate it when NCP is running.
Issue 2745904: The feature "Use IPSet for default running ASG" does not support removing or replacing an existing container IP block
If you enable "Use IPSet for default running ASG" on an NCP tile, NCP will create a dedicated NSGroup for all the container IP blocks configured by "IP Blocks of Container Networks" on the same NCP tile. This NSGroup will be used in the firewall rules created for global running ASGs to allow traffic for all the containers. If you later remove or replace an existing container IP block, it will be removed or replaced in the NSGroup. All the existing containers in the original IP block will no longer be associated with the global running ASGs. Their traffic might no longer work.
Workaround: Only append new IP blocks to "IP Blocks of Container Networks".
Issue 2745907: "monit" commands return incorrect status information for nsx-node-agent
On a diego_cell VM, when monit restarts nsx-node-agent, if it takes more than 30 seconds for nsx-node-agent to fully start, monit will show the status of nsx-node-agent as "Execution failed" and will not update its status to "running" even when nsx-node-agent is fully functional later.
Issue 2735244: nsx-node-agent and nsx-kube-proxy crash because of liveness probe failure
nsx-node-agent and nsx-kube-proxy use sudo to run some commands. If there are many entries in /etc/resolv.conf about DNS server and search domains, sudo can take a long time to resolve hostnames. This will cause nsx-node-agent and nsx-kube-proxy to be blocked by the sudo command for a long time, and liveness probe will fail.
Workaround: Perform one of the two following actions:
Add hostname entries to /etc/hosts. For example, if hostname is 'host1', add the entry '127.0.0.1 host1'.
Set a larger value for the nsx-node-agent liveness probe timeout. Run the command 'kubectl edit ds nsx-node-agent -n nsx-system' to update the timeout value for both the nsx-node-agent and nsx-kube-proxy containers.
Issue 2736412: Parameter members_per_small_lbs is ignored if max_allowed_virtual_servers is set
If both max_allowed_virtual_servers and members_per_small_lbs are set, virtual servers may fail to attach to an available load balancer because only max_allowed_virtual_servers is taken into account.
Workaround: Relax the scale constraints instead of enabling auto scaling.
Issue 2740552: When deleting a static pod using api-server, nsx-node-agent does not remove the pod's OVS bridge port, and the network of the static pod which is re-created automatically by Kubernetes is unavailable
Kubernetes does not allow removing a static pod by api-server. A mirror pod of static pod is created by Kubernetes so that the static pod can be searched by api-server. While deleting the pod by api-server, only the mirror pod will be deleted and NCP will receive and handle the delete request to remove all NSX resource allocated for the pod. However, the static pod still exists, and nsx-node-agent will not get the delete request from CNI to remove OVS bridge port of static pod.
Workaround: Remove the static pod by deleting the manifest file instead of removing the static pod by api-server.
Issue 2795482: Running pod stuck in ContainerCreating state after node/hypervisor reboot or any other operation
If the wait_for_security_policy_sync flag is true, a pod can go to ContainerCreating state after being in running state for more than one hour because of a worker node hard reboot, hypervisor reboot, or some other reason. The pod will be in the creating state forever.
Workaround: Delete and recreate the pod.
Issue 2841030: With Kubernetes 1.22, the status of nsx-node-agent is always 'AppArmor'
With Kubernetes 1.22, when the nsx-node-agent pods are "Ready", their status is not updated from "AppArmor" to "Running". This does not impact the functionality of NCP or nsx-node-agent.
Workaround: Restart the nsx-node-agent pods.
Issue 2824129: A node has the status network-unavailable equal to true for more than 3 minutes after a restart
If you use NCP operator to manage NCP's lifecycle, when an nsx-node-agent daemonset recovers from a non-running state, its node will have the status network-unavailable equal to true until it has been running for 3 minutes. This is expected behavior.
Workaround: Wait for at least 3 minutes after nsx-node-agent restarts.
Issue 2868572: Open vSwitch (OVS) must be disabled on host VM before running NCP
To deploy NCP on a host VM, you must first stop OVS-related processes and delete some files on the host using the following commands:
sudo systemctl disable openvswitch-switch.service
sudo systemctl stop openvswitch-switch.service
rm -rf /var/run/openvswitch
If you have already deployed NCP on a host VM, and OVS is not running correctly, perform the following steps to recover:
Perform the above 3 steps.
Delete nsx-node-agent pods on the nodes having the issue to restart the node agent pods with the command "kubectl delete pod $agent-pod -n nsx-system".
Workaround: See above.
Issue 2832480: For a Kubernetes service of type ClusterIP, sessionAffinityConfig.clientIP.timeoutSeconds cannot exceed 65535
For a Kubernetes service of type ClusterIP, if you set sessionAffinityConfig.clientIP.timeoutSeconds to a value greater than 65535, the actual value will be 65535.
Issue: 2940772: Migrating NCP resources from Manager to Policy results in failure with NSX-T 3.2.0
Migrating NCP resources from Manager to Policy is supported with NSX-T 3.1.3 and NSX-T 3.2.1, but not NSX-T 3.2.0.
Issue 2934195: Some types of NSX groups are not supported for distributed firewall rules
An NSX groups of type "IP Addresses Only" is not supported for distributed firewall (DFW) rules. An NSX group of type "Generic" with manually added IP addresses as members is also not supported.
Issue 2936436: NSX Manager UI does not show the NCP version on the container cluster page
When NSX Manager UI displays the container clusters in the inventory tab, the NCP version is not displayed.
Workaround: The NCP version is available by calling the API /policy/api/v1/fabric/container-clusters.
Issue 2939886: Migrating objects from Manager Mode to Policy Mode fails
Migrating objects from Manager Mode to Policy Mode fails if, in the network policy specification, egress and ingress have the same selector.
Issue: 2961789: After migrating manager objects to policy, some of the health-check pod's related resources cannot be deleted
After migrating manager objects to policy, when you delete the health-check pod, the pod's related segment port and the distributed firewall rule's target group are not deleted.
Workaround: Manually delete those resources.
Issue: 2966586: After migrating manager objects to policy, namespace creation fails
If an IP block is created in manager mode, after manager objects are migrated to policy, namespace creation fails because NCP cannot allocate subnets from this IP block.
Workaround: Create new IP blocks in policy mode and configure NCP to use these new IP blocks.
Issue 2972811: In a large-scale environment, the hyperbus connection to some worker nodes is down
In a large-scale environment, pod creation can get stuck for 10-15 minutes due to rpc channel timeout. The following issues may occur:
In a Kubernetes cluster, some pods will have the status ContainerCreating for 10-15 minutes.
In cfgAgent, the tunnel will have the status COMMUNICATION_ERROR for 10-15 minutes.
In NSX UI, there may be an alarm generated which indicate hyperbus connection down.
Workaround: None needed. This issue will automatically recover after 10-15 minutes.
Issue 2960121: For services of type LoadBalancer connectivity to pods on windows worker nodes fails if not configured correctly
For services of type LoadBalancer connectivity to pods on Windows worker nodes will fail if NCP is configured to use the default LB segment subnet. The default subnet 169.254.128.0/22 belongs to the IPv4 link-local space and is not forwarded on a Windows node.
Workaround: Configure NCP to use a non-default LB segment subnet. To do this, set the parameter lb_segment_subnet in the nsx_v3 section. Note that this will only have effect on newly created NSX load balancers.