VMware NSX Container Plugin 4.1.1 | 17 August 2023 | Build 22071564
Check for additions and updates to these release notes.
VMware NSX Container Plugin 4.1.1 | 17 August 2023 | Build 22071564
Check for additions and updates to these release notes.
Starting with this release, new TAS deployments will only be allowed with NSX Policy. For existing foundations, NCP settings will not be changed upon upgrades.
NSX resources created by TKGi for deployment network are now migrated to Policy during MP-to-Policy migration.
NCP supports up to 500 OCP routes on a given NSX Load Balancer Virtual Server.
You can backup and later restore NSX to a previous state. NCP will ensure that resources in an OpenShift cluster and NSX are in a consistent state.
OpenShift Operator can simplify NCP deployment by automating certain option configurations. Validation of options that are input from the configmap is also enhanced.
The feature that allows access via NAT to Ingress controller pods using the "ncp/ingress_controller" annotation is deprecated and will be removed in 2023. The recommended way to expose Ingress controller pods is to use services of type LoadBalancer.
The nsx-ovs kernel module is deprecated. Only the upstream OVS kernel module is supported, which is the default behavior. The configmap option "use_nsx_ovs_kernel_module" under the "nsx_node_agent" section in the nsx-node-agent configmap is removed.
NCP/NSX Tile for Tanzu Application Service (TAS)
3.2.2, 3.2.3, 188.8.131.52, 4.1.0, 4.1.1
1.24, 1.25, 1.26
4.10, 4.11, 4.12
Kubernetes Host VM OS
Ubuntu 22.04 with kernel 5.15 (both nsx-ovs kernel module and upstream OVS kernel module supported)
Ubuntu 22.04 with kernel later than 5.15 (only upstream OVS kernel module supported)
RHEL 8.4, 8.5, 8.6
See notes below.
Tanzu Application Service (TAS)
Ops Manager 2.10 + TAS 2.13
Ops Manager 3.0 + TAS 2.13
Ops Manager 2.10 + TAS 3.0 (End of support date: 31 October 2023)
Ops Manager 3.0 + TAS 3.0 (End of support date: 31 October 2023)
Ops Manager 2.10 + TAS 4.0
Ops Manager 3.0 + TAS 4.0
Tanzu Kubernetes Grid Integrated (TKGI)
For all supported integrations, use the Red Hat Universal Base Image (UBI). For more information, https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image.
For a new deployment on TAS, only Policy mode is supported.
Support for upgrading to this release:
4.1.0. 4.1.1, and all 4.0.x releases.
The "baseline policy" feature for NCP creates a dynamic group which selects all members in the cluster. NSX-T has a limit of 8,000 effective members of a dynamic group (for details, see Configuration Maximums). Therefore, this feature should not be enabled for clusters that are expected to grow beyond 8,000 pods. Exceeding this limit can cause delays in the creation of resources for the pods.
Transparent mode load balancer
Only north-south traffic for a Kubernetes cluster is supported. Intra-cluster traffic is not supported.
Not supported for services attached to a LoadBalancer CRD or when auto scaling is enabled. Auto scaling must be disabled for this feature to work.
It is recommended to use this feature only on newly deployed clusters.
It is not possible to migrate a Kubernetes cluster if a previous migration failed and the cluster is rolled back. This is a limitation with NSX 184.108.40.206 or earlier releases only.
There is a risk of significant performance degradation in the actual group member calculation, with impact on network traffic, when implementing Network Policies that use multi-selectors criteria in Ingress/Egress rules. To address this limitation, there is a new configuration option, enable_mixed_expression_groups, which affects Kubernetes Network Policies using multi-selectors in Policy mode. Clusters in Manager mode are not affected. The default value of this option is False. We recommend the following values in your cluster:
New clusters, Policy mode: False
Existing clusters (Policy-based): True
After Manager-to-Policy migration: True
OC: Set to True to ensure Kubernetes Network Policy conformance
New Clusters (Policy-based): False
Existing clusters (Policy-based): True
After Manager-to-Policy migration: True
This limitation applies when enable_mixed_expression_groups is set to True. This affects installations that use NCP version 3.2.0 and later, and NSX-T version 3.2.0 and later. There is no limitation on the number of namespaces that the Network Policy affects. If this option is set to True and NCP is restarted, NCP will sync all Network Policies again to implement this behavior.
When enable_mixed_expression_groups is set to False, Network Policies that use multi-selectors criteria in Ingress/Egress rules are realized with dynamic NSX groups that are not affected by any performance degradation in calculating the actual members. However, the rules can be enforced on only up to 5 namespaces, depending on the other criteria defined in the Network Policy. If the Network Policy affects more than 5 namespaces at any point in time, it will be annotated with "ncp/error: NETWORK_POLICY_VALIDATION_FAILED" and not enforced in NSX. Note that this can happen when a new namespace is created that satisfies the multi-selector conditions or an existing namespace is updated. If this option is set to False and NCP is restarted, NCP will sync all Network Policies again to implement this behavior.
Issue 3049209: After manager-to-policy migration, deleting clusters does not delete mp_default_LR_xxx_user_rules resource
After performing a manager-to-policy migration, and then deleting clusters, some "GatewayPolicy" resources named mp_default_LR_xxxx_user_rules may not get deleted.
Workaround: Delete the resources manually.
Issue 3113985: When migrating a single-tier1 topology, not all static routes are migrated
In a single-tier1 topology with multiple custom resources of type loadbalancers.vmware.com, some static routes created by NCP in Manager mode for the load balancers are not migrated.
Workaround: After deleting the custom resource of type loadbalancers.vmware.com from Kubernetes, manually delete the static route with Manager API. The static route will have the UID of the custom resource in its tags with scope "ncp/crd_lb_uid".
Issue 3055618: When creating multiple Windows pods on a node simultaneously, some pods do not have a network adapter
When applying a yaml file to create multiple Windows pods on the same node, some pods do not have a network adapter.
Workaround: Restart the pods.
Issue 3088138: After setting log_file in nsx-node-agent-config configmap, nsx-node-agent pods fails to start
If you set the log_file option in nsx-node-agent-config configmap and restart the nsx-ncp-bootstrap pods before the nsx-node-agent pods, the nsx-node-agent pods will fail to start and be in the CrashLoopBackOff state.
Workaround: Restart the nsx-node-agent pods before restarting the nsx-ncp-bootstrap pods after setting the log_file option in nsx-node-agent-config configmap.
Issue 3091318: Pod creation fails after updating a namespace's static subnet when NCP is down
If you create a namespace with ncp/subnets set, for example, to 220.127.116.11/29, and no pods have been created in the namespace yet, and you stop NCP and update ncp/subnets to, for example, 18.104.22.168/29, after NCP is restarted, creating a pod in the namespace may fail, with the pod stuck in the "ContainerCreating" state.
Workaround: Recreate the pod.
Issue 3110833: Pods on TKGI Windows worker node cannot start, status is "ContainerCreating"
Every node on the Windows worker node fails to start. The nsx-node-agent log on the node continuously reports "Failed to process cif config request with error [...]. Restart node agent service to recover."
Workaround: Restart the nsx-node-agent service on the node.
Issue 3306543: After deleting a pod, a new pod created with the same name has incorrect networking configuration
On rare occasions, if you delete a pod and create a new pod with the same name, the new pod will have the old pod's network configuration. The new pod's network configuration will be incorrect.
Workaround: Delete the new pod, restart nsx-node-agent and then re-create the pod.
Issue 3239352: In a TAS environment, when a Task cannot be allocated, retry may not work
In an NCP TAS environment, when a Task cannot be allocated the Auctioneer rejects the task and the BBS retries placement of the task up to the number of times specified by the setting task.max_retries. When task.max_retries is reached, the BBS updates the Task from the PENDING state to the COMPLETED state, marking it as Failed and including a FailureReason that explains that the cluster has no capacity for the task.
During retry, the task may be scheduled to a new cell which notifies NCP with a task_changed event. Since NCP does not handle the task_changed event the task cannot be assigned a new port in the new cell. The task cannot run properly.
Workaround: Disable the retry and set the task.max_retries value to 0.
Issue 3252571: Manager-to-Policy migration never completes if NSX Manager becomes unavailable
If NSX Manager becomes unavailable during Manager-to-Policy migration, the migration may never complete. One indication is that the logs will have no updates about the migration.
Workaround: Re-establish the connection to NSX Manager and restart the migration.
Issue 3248662: Worker node fails to access a service. The OVS flow for the service is not created on the node.
The nsx-kube-proxy log has the error message "greenlet.error: cannot switch to a different thread."
Workaround: Restart nsx-kube-proxy on the node.
Issue 3241693: Layer-7 routes take more than 10 minutes to start working when the number of routes created exceeds some limits
In an OpenShift environment, you can deploy more than 1000 routes by setting the flags 'relax_scale_validation' to True and 'l4_lb_auto_scaling' to False in the ConfigMap. However, routes will take more than 10 minutes to start working when the number of routes created exceeds limitation. The limits are 500 HTTPs routes and 2000 HTTP routes.
Workaround: Do not exceed the limits for the number of routes. If you create 500 HTTPS plus 2000 HTTP routes, you must deploy the routes using a large-size edge VM.
Issue 3158230: nsx-ncp-bootstrap container fails to initialize while loading AppArmor profiles on Ubuntu 20.04
The nsx-ncp-bootstrap container in nsx-ncp-bootstrap DaemonSet fails to initialize because of different package versions of AppArmor on the host OS and the container image. The logs of the container show messages such as "Failed to load policy-features from '/etc/apparmor.d/abi/2.13': No such file or directory".
Workaround: Update AppArmor to version 2.13.3-7ubuntu5.2 or the latest available from focal-updates on the host OS.
Issue 3179549: Changing the NAT mode for an existing namespace is not supported
For a namespace with existing pods, if you change the NAT mode from SNAT to NO_SNAT, the pods will still use IP addresses allocated from the IP blocks specified in container_ip_blocks. If the segment subnet in the namespace still has available IP addresses, newly created pods will still use the IP addresses of the existing segment subnet. For a newly created segment, the subnet is allocated from no_snat_ip_block. But on the namespace, the SNAT rule will be deleted.
Issue 3218243: Security Policy in NSX created for Kubernetes Network Policy that uses multi-selector criteria gets removed after upgrading NCP to version 4.1.1 or when user creates/updates namespace
Verify that the option "enable_mixed_expression_groups" is set to False in NCP (default value is False). If that is the case, the Network Policy is leading to the creation of more than 5 group criteria on NSX, which is not supported.
Workaround: Set enable_mixed_expression_groups to True in NCP config map and restart NCP. Note that there is a risk of significant performance degradation in the actual group member calculation with impact on network traffic in this case.
Issue 3235394: The baseline policy with namespace setting does not work in a TKGI setup
In a TGKI environment, if you set baseline_policy_type to allow_namespace or allow_namespace_strict, NCP will create an explicit baseline policy to allow only pods within the same namespace to communicate with each other and deny ingress from other namespaces. This baseline policy will also block a system namespace, such as kube-system, from accessing pods in different namespaces.
Workaround: None. NCP does not support this feature in a TKGI setup.
Issue 3179960: Application instance not reachable after vMotion and has the same IP address as another application instance
When bulk vMotion happens, for example, during NSX host upgrade, hosts go into maintenance mode one by one and Diego Cells migrate between hosts. After the vMotion, some segment ports might be missing, some application instances might be unreachable, and two application instances might have the same IP address. This issue is more likely to happen with TAS 2.13.18.
Workaround: Re-create the application instances affected by this issue.
Issue 3108579: Deleting LB CRD and recreating it immediately with the same secret fails
In Manager mode, if you delete Ingress on an LB CRD, delete the LB CRD, and immediately recreate the Ingress and LB CRD with the same certificate, you may see the error "Attempted to import a certificate which has already been imported." This is caused by a timing issue because the deletion of LB CRD must wait for the deletion of Ingress to be completed.
Workaround: Do one of the following:
- Run the following command to wait for the deletion of Ingress to be completed and then delete the LB CRD.
- Wait for at least 2 minutes before recreating the Ingress and LB CRD.
Issue 3161931: nsx-ncp-bootstrap pod fails to run on Ubuntu 18.04 and Ubuntu 20.04 host VMs
The nsx-ncp-bootstrap container in the nsx-ncp-bootstrap pod fails to reload "AppArmor" with the following log messages: "Failed to load policy-features from '/etc/apparmor.d/abi/2.13': No such file or directory." The issue is caused by different versions of the "AppArmor" package installed in the image used to run the nsx-ncp-bootstrap pod and host OS. This issue does not exist on Ubuntu 22.04 host VMs.
Workaround: Ubuntu 18.04 is not supported with NCP 4.1.1. On Ubuntu 20.04, update "AppArmor" to the minimum version 2.13.3-7ubuntu5.2. The package is available via focal-updates.
Issue 3221191: Creation of domain group fails when cluster has more than 4000 pods
If the NCP option k8s.baseline_policy_type is set to allow_cluster, allow_namespace, or allow_namespace_strict, and the cluster has more than 4000 pods, the domain group (with a name such as dg-k8sclustername), which contains all the IP addresses of the pods, will fail to be created. This is caused by a limitation on NSX.
Workaround: Do not set the option k8s.baseline_policy_type or ensure that there are fewer than 4000 pods in the cluster.
Issue 3043496: NCP stops running if Manager-to-Policy migration fails
NCP provides the migrate-mp2p job to migrate NSX resources used by NCP and TKGI. If migration fails, all migrated resources are rolled back but NCP is not restarted in Manager mode.
Make sure that all resources were rolled back. This can be done by checking the logs of the migrate-mp2p job. The logs must end with the line "All imported MP resources to Policy completely rolled back."
If all resources were rolled back, ssh into each master node and run the command "sudo /var/vcap/bosh/bin/monit start ncp".
Issue 2131494: NGINX Kubernetes Ingress still works after changing the Ingress class from nginx to nsx
When you create an NGINX Kubernetes Ingress, NGINX create traffic forwarding rules. If you change the Ingress class to any other value, NGINX does not delete the rules and continues to apply them, even if you delete the Kubernetes Ingress after changing the class. This is a limitation of NGINX.
Workaround: To delete the rules created by NGINX, delete the Kubernetes Ingress when the class value is nginx. Than re-create the Kubernetes Ingress.
Issue 2999131: ClusterIP services not reachable from the pods
In a large-scale TKGi environment, ClusterIP services are not reachable from the pods. Other related issues are: (1) The nsx-kube-proxy stops outputting the logs of nsx-kube-proxy; and (2) The OVS flows are not created on the node.
Workaround: Restart nsx-kube-proxy.
Issue 2984240: The "NotIn" operator in matchExpressions does not work in namespaceSelector for a network policy's rule
When specifying a rule for a network policy, if you specify namespaceSelector, matchExpressions and the "NotIn" operator, the rule does not work. The NCP log has the error message "NotIn operator is not supported in NS selectors."
Workaround: Rewrite matchExpressions to avoid using the "NotIn" operator.
Issue 3033821: After manager-to-policy migration, distributed firewall rules not enforced correctly
After a manager-to-policy migration, newly created network policy-related distributed firewall (DFW) rules will have higher priority than the migrated DFW rules.
Workaround: Use the policy API to change the sequence of DFW rules as needed.
For a Kubernetes service of type ClusterIP, the hairpin-mode flag is not supported
NCP does not support the hairpin-mode flag for a Kubernetes service of type ClusterIP.
Issue 2224218: After a service or app is deleted, it takes 2 minutes to release the SNAT IP back to the IP pool
If you delete a service or app and recreate it within 2 minutes, it will get a new SNAT IP from the IP pool.
Workaround: After deleting a service or app, wait 2 minutes before recreating it if you want to reuse the same IP.
Issue 2404302: If multiple load balancer application profiles for the same resource type (for example, HTTP) exist on NSX-T, NCP will choose any one of them to attach to the Virtual Servers.
If multiple HTTP load balancer application profiles exist on NSX-T, NCP will choose any one of them with the appropriate x_forwarded_for configuration to attach to the HTTP and HTTPS Virtual Server. If multiple FastTCP and UDP application profiles exist on NSX-T, NCP will choose any one of them to attach to the TCP and UDP Virtual Servers, respectively. The load balancer application profiles might have been created by different applications with different settings. If NCP chooses to attach one of these load balancer application profiles to the NCP-created Virtual Servers, it might break the workflow of other applications.
Issue 2518111: NCP fails to delete NSX-T resources that have been updated from NSX-T
NCP creates NSX-T resources based on the configurations that you specify. If you make any updates to those NSX-T resources through NSX Manager or the NSX-T API, NCP might fail to delete those resources and re-create them when it is necessary to do so.
Workaround: Do not update NSX-T resources created by NCP through NSX Manager or the NSX-T API.
Issue 2416376: NCP fails to process a TAS ASG (App Security Group) that binds to more than 128 Spaces
Because of a limit in NSX-T distributed firewall, NCP cannot process a TAS ASG that binds to more than 128 Spaces.
Workaround: Create multiple ASGs and bind each of them to no more than 128 Spaces.
NCP fails to start when "logging to file" is enabled during Kubernetes installation
This issue happens when uid:gid=1000:1000 on the container host does not have permission to the log folder.
Workaround: Do one of the following:
Change the mode of the log folder to 777 on the container hosts.
Grant “rwx” permission of the log folder to uid:gid=1000:1000 on the container hosts.
Disable the “logging to file” feature.
Issue 2653214: Error while searching the segment port for a node after the node's IP address was changed
After changing a node's IP address, if you upgrade NCP or if the NCP operator pod is restarted, checking the NCP operator status with the command "oc describe co nsx-ncp" will show the error message "Error while searching segment port for node ..."
Workaround: None. Adding a static IP address on a node interface which also has DHCP configuration is not supported.
Issue 2672677: In a highly stressed OpenShift 4 environment, a node can become unresponsive
In an OpenShift 4 environment with a high level of pod density per node and a high frequency of pods getting deleted and created, a RHCOS node might go into a "Not Ready" state. Pods running on the affected node, with the exception of daemonset members, will be evicted and recreated on other nodes in the environment.
Workaround: Reboot the impacted node.
Issue 2707174: A Pod that is deleted and recreated with the same namespace and name has no network connectivity
If a Pod is deleted and recreated with the same namespace and name when NCP is not running and nsx-ncp-agents are running, the Pod might get wrong network configurations and not be able to access the network.
Workaround: Delete the Pod and recreate it when NCP is running.
Issue 2745907: "monit" commands return incorrect status information for nsx-node-agent
On a diego_cell VM, when monit restarts nsx-node-agent, if it takes more than 30 seconds for nsx-node-agent to fully start, monit will show the status of nsx-node-agent as "Execution failed" and will not update its status to "running" even when nsx-node-agent is fully functional later.
Issue 2735244: nsx-node-agent and nsx-kube-proxy crash because of liveness probe failure
nsx-node-agent and nsx-kube-proxy use sudo to run some commands. If there are many entries in /etc/resolv.conf about DNS server and search domains, sudo can take a long time to resolve hostnames. This will cause nsx-node-agent and nsx-kube-proxy to be blocked by the sudo command for a long time, and liveness probe will fail.
Workaround: Perform one of the two following actions:
Add hostname entries to /etc/hosts. For example, if hostname is 'host1', add the entry '127.0.0.1 host1'.
Set a larger value for the nsx-node-agent liveness probe timeout. Run the command 'kubectl edit ds nsx-node-agent -n nsx-system' to update the timeout value for both the nsx-node-agent and nsx-kube-proxy containers.
Issue 2736412: Parameter members_per_small_lbs is ignored if max_allowed_virtual_servers is set
If both max_allowed_virtual_servers and members_per_small_lbs are set, virtual servers may fail to attach to an available load balancer because only max_allowed_virtual_servers is taken into account.
Workaround: Relax the scale constraints instead of enabling auto scaling.
Issue 2740552: When deleting a static pod using api-server, nsx-node-agent does not remove the pod's OVS bridge port, and the network of the static pod which is re-created automatically by Kubernetes is unavailable
Kubernetes does not allow removing a static pod by api-server. A mirror pod of static pod is created by Kubernetes so that the static pod can be searched by api-server. While deleting the pod by api-server, only the mirror pod will be deleted and NCP will receive and handle the delete request to remove all NSX resource allocated for the pod. However, the static pod still exists, and nsx-node-agent will not get the delete request from CNI to remove OVS bridge port of static pod.
Workaround: Remove the static pod by deleting the manifest file instead of removing the static pod by api-server.
Issue 2824129: A node has the status network-unavailable equal to true for more than 3 minutes after a restart
If you use NCP operator to manage NCP's lifecycle, when an nsx-node-agent daemonset recovers from a non-running state, its node will have the status network-unavailable equal to true until it has been running for 3 minutes. This is expected behavior.
Workaround: Wait for at least 3 minutes after nsx-node-agent restarts.
Issue 2832480: For a Kubernetes service of type ClusterIP, sessionAffinityConfig.clientIP.timeoutSeconds cannot exceed 65535
For a Kubernetes service of type ClusterIP, if you set sessionAffinityConfig.clientIP.timeoutSeconds to a value greater than 65535, the actual value will be 65535.
Issue: 2940772: Migrating NCP resources from Manager to Policy results in failure with NSX-T 3.2.0
Migrating NCP resources from Manager to Policy is supported with NSX-T 3.1.3 and NSX-T 3.2.1, but not NSX-T 3.2.0.
Issue 2934195: Some types of NSX groups are not supported for distributed firewall rules
An NSX groups of type "IP Addresses Only" is not supported for distributed firewall (DFW) rules. An NSX group of type "Generic" with manually added IP addresses as members is also not supported.
Issue 2939886: Migrating objects from Manager Mode to Policy Mode fails
Migrating objects from Manager Mode to Policy Mode fails if, in the network policy specification, egress and ingress have the same selector.
Issue 3066449: Namespace subnets are not always allocated from the first available IP block when use_ip_blocks_in_order is set to True
When creating multiple namespaces with use_ip_blocks_in_order set to True, the first namespace's subnet is sometimes not allocated from the first available IP block. For example, assume that container_ip_blocks = '22.214.171.124/28,126.96.36.199/28', and subnet prefix length is 29，and subnet 188.8.131.52/29 is already allocated. If you create 2 namespaces ns-1 and ns-2, the subnets allocation could be (1) ns-1: 184.108.40.206/29, ns-2: 220.127.116.11/29, or (2) ns-1: 18.104.22.168/29, ns-2: 22.214.171.124/29.
The use_ip_blocks_in_order parameter only ensures that different IP blocks are used in the order they appear in the container_ip_blocks parameter. When creating multiple namespaces at the same time, any namespace may request a subnet through an API call before another namespace. Therefore, there is no guarantee that a specific namespace will be allocated a subnet from a specific IP block.
Workaround: Create the namespaces separately, that is, create the first namespace, make sure its subnet has been allocated, and then create the next namespace.