VMware NSX Container Plugin 4.2.1 | 23 SEP 2024 | Build 24284046 Check for additions and updates to these release notes. |
VMware NSX Container Plugin 4.2.1 | 23 SEP 2024 | Build 24284046 Check for additions and updates to these release notes. |
Support for the rollback of external IP allocations that were migrated during the migration of a Kubernetes cluster or TAS foundation. This feature requires NSX 4.2.1.
Product |
Version |
NCP/NSX-T Tile for Tanzu Application Service (TAS) |
4.2.1 |
NSX-T/NSX |
NSX-T 3.2.4, 3.2.5. NSX 4.1.0.3, 4.1.2.4, 4.1.2.5, 4.2.0, 4.2.1. (See notes below.) |
vSphere |
6.7, 7.0, 8.0.0.1 |
Kubernetes |
1.28, 1.29, 1.30 |
OpenShift 4 |
4.14, 4.15, 4.16 |
Kubernetes Host VM OS |
Ubuntu 20.04 (only kernel version 5.15 or earlier supported) Ubuntu 22.04 (only kernel version 5.15 or earlier supported, only upstream OVS kernel module supported) Ubuntu 24.04 RHEL 8.8, 8.9, 8.10, 9.3, 9.4 See notes below. |
Tanzu Application Service (TAS) |
Ops Manager 3.0 + TAS 4.0 Ops Manager 3.0 + TAS 5.0 Ops Manager 3.0 + TAS 6.0 |
Tanzu Kubernetes Grid Integrated (TKGI) |
1.18, 1.19, 1.20 |
Notes:
NSX-T 3.2.4 and 3.2.5 are supported with basic sanity testing coverage. See Product Interoperability Matrix.
The installation of the nsx-ovs kernel module on RHEL requires a specific kernel version. The supported RHEL kernel versions are 193, 305, 348, and 372, regardless of the RHEL version. If you are running a different kernel version, you can (1) Modify your kernel version to one that is supported. When modifying the kernel version and then restarting the VM, make sure that the IP and static routes are persisted on the uplink interface (specified by ovs_uplink_port) to guarantee that connectivity to the Kubernetes API server is not lost. Or (2) Skip the installation of the nsx-ovs kernel module by setting "use_nsx_ovs_kernel_module" to "False" under the "nsx_node_agent" section in the nsx-node-agent config map. For information about switching between NSX-OVS and upstream OVS kernel modules, see https://docs.vmware.com/en/VMware-NSX-Container-Plugin/4.1/ncp-kubernetes/GUID-7225DDCB-88CB-4A2D-83A3-74BB9ED7DCFF.html.
To run the nsx-ovs kernel module on RHEL, you must disable the "UEFI secure boot" option under "Boot Options" in the VM's settings in vCenter Server.
For all supported integrations, use the Red Hat Universal Base Image (UBI). For more information, https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image.
Support for upgrading to this release:
All previous 4.1.x releases
The "baseline policy" feature for NCP creates a dynamic group which selects all members in the cluster. NSX-T has a limit of 8,000 effective members of a dynamic group (for details, see Configuration Maximums). Therefore, this feature should not be enabled for clusters that are expected to grow beyond 8,000 pods. Exceeding this limit can cause delays in the creation of resources for the pods.
Transparent mode load balancer
Only north-south traffic for a Kubernetes cluster is supported. Intra-cluster traffic is not supported.
Not supported for services attached to a LoadBalancer CRD or when auto scaling is enabled. Auto scaling must be disabled for this feature to work.
It is recommended to use this feature only on newly deployed clusters.
Manager-to-policy migration
It is not possible to migrate a Kubernetes cluster if a previous migration failed and the cluster is rolled back. This is a limitation with NSX 4.0.0.1 or earlier releases only.
Issue 3398126: Deploying a new cluster fails in the pks-nsx-t-prepare-master-vm job
In the TKGI master VM, the job log file /var/vcap/data/sys/log/pks-nsx-t-prepare-master-vm/pre-start.stdout.log has an error message such as the following:
"create loadbalancer: update lb service: [PUT /infra/lb-services/{lb-service-id}][400] updateLBServiceBadRequest &{RelatedAPIError:{Details: ErrorCode:502001 ErrorData:<nil> ErrorMessage:Errors validating path=[/infra/lb-services/lb-pks-b5ef6df4-cd11-4461-8861-893533940ecb]. ModuleName:policy} RelatedErrors:[0xc0001b25a0]}"
Workarounds: Perform one of the following actions:
Use a different edge cluster with load balancer capacity on most nodes.
Add nodes to the current edge clusters (in pairs to allow deployment of both active and standby service router).
Reconfigure allocation pools for existing TKGI cluster's tier-1 router (this does not apply to routers created for namespaces in a dedicated tier-1 topology).
Issue 3396034: Manager-to-policy migration fails because the migration process cannot infer the Service UUID needed to migrate NCP-created load balancer pools
During a manager-to-policy migration, tags on certain NSX resources created by NCP need to be updated. This operation may require specific Kubernetes resources to exist. If these resources do not exist, migration fails and all NSX resources are rollbacked to manager mode. In this case, manager-to-policy migration fails because Ingress has rules that use a Service that does not exist in Kubernetes.
Workaround: Remove the Ingress rules that are using Kubernetes Services that no longer exist.
Issue 3382101: After a Kubernetes service of type LoadBalancer was removed, its Load Balancer Virtual Server and Load Balancer pool are still configured on NSX
During the initial synchronization between NCP and NSX, NCP did not receive the full list of load balancer objects from NSX. The problem can also be caused by an unexpected error during garbage collection. Note that this issue will be resolved automatically if there is an NCP HA failover because a synchronization will be triggered.
Workaround: Restart the NCP master instance to re-trigger the synchronization.
Issue 3327390: In an OCP environment, nsx-node-agent has high memory usage
In some situations, the nsx-ovs container inside an nsx-node-agent pod may have high memory usage, and the memory usage keeps increasing. This is caused by the multicast snooping check in the nsx-ovs container.
Workaround:
For OpenShift 4.11 or later:
Step 1. Set enable_ovs_mcast_snooping
to False
in nsx-ncp-operator-config ConfigMap:
[nsx_node_agent] enable_ovs_mcast_snooping = False
Step 2. Disable OVS liveness probe from nsx node agent DaemonSet. Note that you must disable it again every time the operator restarts because NCP operator will revert to the default nsx node agent DaemonSet manifest.
For OpenShift versions earlier than 4.11:
Step 1. Run the following command to clear the cache.
$ echo 2 > /proc/sys/vm/drop_caches
Step 2. Disable OVS liveness probe from nsx node agent DaemonSet. Note that you must disable it again every time the operator restarts because NCP operator will revert to the default nsx node agent DaemonSet manifest.
Issue 3158230: nsx-ncp-bootstrap container fails to initialize while loading AppArmor profiles on Ubuntu 20.04
The nsx-ncp-bootstrap container in nsx-ncp-bootstrap DaemonSet fails to initialize because of different package versions of AppArmor on the host OS and the container image. The logs of the container show messages such as "Failed to load policy-features from '/etc/apparmor.d/abi/2.13': No such file or directory".
Workaround: Update AppArmor to version 2.13.3-7ubuntu5.2 or the latest available from focal-updates on the host OS.
Issue 3161931: nsx-ncp-bootstrap pod fails to run on Ubuntu 18.04 and Ubuntu 20.04 host VMs
The nsx-ncp-bootstrap container in the nsx-ncp-bootstrap pod fails to reload "AppArmor" with the following log messages: "Failed to load policy-features from '/etc/apparmor.d/abi/2.13': No such file or directory." The issue is caused by different versions of the "AppArmor" package installed in the image used to run the nsx-ncp-bootstrap pod and host OS. This issue does not exist on Ubuntu 22.04 host VMs.
Workaround: Ubuntu 18.04 is not supported. On Ubuntu 20.04, update "AppArmor" to the minimum version 2.13.3-7ubuntu5.2. The package is available via focal-updates.
Issue 2131494: NGINX Kubernetes Ingress still works after changing the Ingress class from nginx to nsx
When you create an NGINX Kubernetes Ingress, NGINX create traffic forwarding rules. If you change the Ingress class to any other value, NGINX does not delete the rules and continues to apply them, even if you delete the Kubernetes Ingress after changing the class. This is a limitation of NGINX.
Workaround: To delete the rules created by NGINX, delete the Kubernetes Ingress when the class value is nginx. Than re-create the Kubernetes Ingress.
Issue 2999131: ClusterIP services not reachable from the pods
In a large-scale TKGi environment, ClusterIP services are not reachable from the pods. Other related issues are: (1) The nsx-kube-proxy stops outputting the logs of nsx-kube-proxy; and (2) The OVS flows are not created on the node.
Workaround: Restart nsx-kube-proxy.
Issue 2984240: The "NotIn" operator in matchExpressions does not work in namespaceSelector for a network policy's rule
When specifying a rule for a network policy, if you specify namespaceSelector, matchExpressions and the "NotIn" operator, the rule does not work. The NCP log has the error message "NotIn operator is not supported in NS selectors."
Workaround: Rewrite matchExpressions to avoid using the "NotIn" operator.
Issue 2997828: Migration of cluster from Manager mode to Policy mode fails if Ingress has more than 255 rules
In Policy mode, an NSX load balancer can support a maximum of 255 rules. If a cluster has an Ingress resource that has more than 255 rules, migrating the cluster from Manager mode to Policy mode will fail.
Workaround: Create LoadBalancer CRDs to distribute the rules across multiple NSX load balancers.
Issue 3033821: After manager-to-policy migration, distributed firewall rules not enforced correctly
After a manager-to-policy migration, newly created network policy-related distributed firewall (DFW) rules will have higher priority than the migrated DFW rules.
Workaround: Use the policy API to change the sequence of DFW rules as needed.
For a Kubernetes service of type ClusterIP, the hairpin-mode flag is not supported
NCP does not support the hairpin-mode flag for a Kubernetes service of type ClusterIP.
Workaround: None
Issue 2224218: After a service or app is deleted, it takes 2 minutes to release the SNAT IP back to the IP pool
If you delete a service or app and recreate it within 2 minutes, it will get a new SNAT IP from the IP pool.
Workaround: After deleting a service or app, wait 2 minutes before recreating it if you want to reuse the same IP.
Issue 2404302: If multiple load balancer application profiles for the same resource type (for example, HTTP) exist on NSX-T, NCP will choose any one of them to attach to the Virtual Servers.
If multiple HTTP load balancer application profiles exist on NSX-T, NCP will choose any one of them with the appropriate x_forwarded_for configuration to attach to the HTTP and HTTPS Virtual Server. If multiple FastTCP and UDP application profiles exist on NSX-T, NCP will choose any one of them to attach to the TCP and UDP Virtual Servers, respectively. The load balancer application profiles might have been created by different applications with different settings. If NCP chooses to attach one of these load balancer application profiles to the NCP-created Virtual Servers, it might break the workflow of other applications.
Workaround: None
Issue 2518111: NCP fails to delete NSX-T resources that have been updated from NSX-T
NCP creates NSX-T resources based on the configurations that you specify. If you make any updates to those NSX-T resources through NSX Manager or the NSX-T API, NCP might fail to delete those resources and re-create them when it is necessary to do so.
Workaround: Do not update NSX-T resources created by NCP through NSX Manager or the NSX-T API.
Issue 2416376: NCP fails to process a TAS ASG (App Security Group) that binds to more than 128 Spaces
Because of a limit in NSX-T distributed firewall, NCP cannot process a TAS ASG that binds to more than 128 Spaces.
Workaround: Create multiple ASGs and bind each of them to no more than 128 Spaces.
Issue 2537221: After upgrading NSX-T to 3.0, the networking status of container-related objects in the NSX Manager UI is shown as Unknown
In NSX Manager UI, the tab Inventory > Containers shows container-related objects and their status. In a TKGI environment, after upgrading NSX-T to 3.0, the networking status of the container-related objects is shown as Unknown. The issue is caused by the fact that TKGI does not detect the version change of NSX-T. This issue does not occur if NCP is running as a pod and the liveness probe is active.
Workaround: After the NSX-T upgrade, restart the NCP instances gradually (no more than 10 at the same time) so as not to overload NSX Manager.
Issue 2552564: In an OpenShift 4.3 environment, DNS forwarder might stop working if overlapping address found
In an OpenShift 4.3 environment, cluster installation requires that a DNS server be configured. If you use NSX-T to configure a DNS forwarder and there is IP address overlap with the DNS service, the DNS forwarder will stop working and cluster installation will fail.
Workaround: Configure an external DNS service, delete the cluster that failed to install and recreate the cluster.
Issue 2597423: When importing manager objects to policy, a rollback will cause the tags of some resources to be lost
When importing manager objects to policy, if a rollback is necessary, the tags of the following objects will not be restored:
Spoofguard profiles (part of shared and cluster resources)
BgpneighbourConfig (part of shared resources)
BgpRoutingConfig (part of shared resources)
StaticRoute BfdPeer (part of shared resources)
Workaround: For resources that are part of the shared resources, manually restore the tags. Use the backup and restore feature to restore resources that are part of cluster resources.
Issue 2579968: When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools are not be deleted as expected
When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools might remain in the NSX-T environment when they should be deleted.
Workaround: Restart NCP. Alternatively, manually remove stale virtual servers and their associated resources. A virtual server is stale if no Kubernetes service of type LoadBalancer has the virtual server's identifier in the external_id tag.
NCP fails to start when "logging to file" is enabled during Kubernetes installation
This issue happens when uid:gid=1000:1000 on the container host does not have permission to the log folder.
Workaround: Do one of the following:
Change the mode of the log folder to 777 on the container hosts.
Grant “rwx” permission of the log folder to uid:gid=1000:1000 on the container hosts.
Disable the “logging to file” feature.
Issue 2653214: Error while searching the segment port for a node after the node's IP address was changed
After changing a node's IP address, if you upgrade NCP or if the NCP operator pod is restarted, checking the NCP operator status with the command "oc describe co nsx-ncp" will show the error message "Error while searching segment port for node ..."
Workaround: None. Adding a static IP address on a node interface which also has DHCP configuration is not supported.
Issue 2672677: In a highly stressed OpenShift 4 environment, a node can become unresponsive
In an OpenShift 4 environment with a high level of pod density per node and a high frequency of pods getting deleted and created, a RHCOS node might go into a "Not Ready" state. Pods running on the affected node, with the exception of daemonset members, will be evicted and recreated on other nodes in the environment.
Workaround: Reboot the impacted node.
Issue 2707174: A Pod that is deleted and recreated with the same namespace and name has no network connectivity
If a Pod is deleted and recreated with the same namespace and name when NCP is not running and nsx-ncp-agents are running, the Pod might get wrong network configurations and not be able to access the network.
Workaround: Delete the Pod and recreate it when NCP is running.
Issue 2745904: The feature "Use IPSet for default running ASG" does not support removing or replacing an existing container IP block
If you enable "Use IPSet for default running ASG" on an NCP tile, NCP will create a dedicated NSGroup for all the container IP blocks configured by "IP Blocks of Container Networks" on the same NCP tile. This NSGroup will be used in the firewall rules created for global running ASGs to allow traffic for all the containers. If you later remove or replace an existing container IP block, it will be removed or replaced in the NSGroup. All the existing containers in the original IP block will no longer be associated with the global running ASGs. Their traffic might no longer work.
Workaround: Only append new IP blocks to "IP Blocks of Container Networks".
Issue 2745907: "monit" commands return incorrect status information for nsx-node-agent
On a diego_cell VM, when monit restarts nsx-node-agent, if it takes more than 30 seconds for nsx-node-agent to fully start, monit will show the status of nsx-node-agent as "Execution failed" and will not update its status to "running" even when nsx-node-agent is fully functional later.
Workaround: None.
Issue 2735244: nsx-node-agent and nsx-kube-proxy crash because of liveness probe failure
nsx-node-agent and nsx-kube-proxy use sudo to run some commands. If there are many entries in /etc/resolv.conf about DNS server and search domains, sudo can take a long time to resolve hostnames. This will cause nsx-node-agent and nsx-kube-proxy to be blocked by the sudo command for a long time, and liveness probe will fail.
Workaround: Perform one of the two following actions:
Add hostname entries to /etc/hosts. For example, if hostname is 'host1', add the entry '127.0.0.1 host1'.
Set a larger value for the nsx-node-agent liveness probe timeout. Run the command 'kubectl edit ds nsx-node-agent -n nsx-system' to update the timeout value for both the nsx-node-agent and nsx-kube-proxy containers.
Issue 2736412: Parameter members_per_small_lbs is ignored if max_allowed_virtual_servers is set
If both max_allowed_virtual_servers and members_per_small_lbs are set, virtual servers may fail to attach to an available load balancer because only max_allowed_virtual_servers is taken into account.
Workaround: Relax the scale constraints instead of enabling auto scaling.
Issue 2740552: When deleting a static pod using api-server, nsx-node-agent does not remove the pod's OVS bridge port, and the network of the static pod which is re-created automatically by Kubernetes is unavailable
Kubernetes does not allow removing a static pod by api-server. A mirror pod of static pod is created by Kubernetes so that the static pod can be searched by api-server. While deleting the pod by api-server, only the mirror pod will be deleted and NCP will receive and handle the delete request to remove all NSX resource allocated for the pod. However, the static pod still exists, and nsx-node-agent will not get the delete request from CNI to remove OVS bridge port of static pod.
Workaround: Remove the static pod by deleting the manifest file instead of removing the static pod by api-server.
Issue 2795482: Running pod stuck in ContainerCreating state after node/hypervisor reboot or any other operation
If the wait_for_security_policy_sync flag is true, a pod can go to ContainerCreating state after being in running state for more than one hour because of a worker node hard reboot, hypervisor reboot, or some other reason. The pod will be in the creating state forever.
Workaround: Delete and recreate the pod.
Issue 2841030: With Kubernetes 1.22, the status of nsx-node-agent is always 'AppArmor'
With Kubernetes 1.22, when the nsx-node-agent pods are "Ready", their status is not updated from "AppArmor" to "Running". This does not impact the functionality of NCP or nsx-node-agent.
Workaround: Restart the nsx-node-agent pods.
Issue 2824129: A node has the status network-unavailable equal to true for more than 3 minutes after a restart
If you use NCP operator to manage NCP's lifecycle, when an nsx-node-agent daemonset recovers from a non-running state, its node will have the status network-unavailable equal to true until it has been running for 3 minutes. This is expected behavior.
Workaround: Wait for at least 3 minutes after nsx-node-agent restarts.
Issue 2868572: Open vSwitch (OVS) must be disabled on host VM before running NCP
To deploy NCP on a host VM, you must first stop OVS-related processes and delete some files on the host using the following commands:
sudo systemctl disable openvswitch-switch.service
sudo systemctl stop openvswitch-switch.service
rm -rf /var/run/openvswitch
If you have already deployed NCP on a host VM, and OVS is not running correctly, perform the following steps to recover:
Perform the above 3 steps.
Delete nsx-node-agent pods on the nodes having the issue to restart the node agent pods with the command "kubectl delete pod $agent-pod -n nsx-system".
Workaround: See above.
Issue: 2940772: Migrating NCP resources from Manager to Policy results in failure with NSX-T 3.2.0
Migrating NCP resources from Manager to Policy is supported with NSX-T 3.1.3 and NSX-T 3.2.1, but not NSX-T 3.2.0.
Workaround: None
Issue 2934195: Some types of NSX groups are not supported for distributed firewall rules
An NSX groups of type "IP Addresses Only" is not supported for distributed firewall (DFW) rules. An NSX group of type "Generic" with manually added IP addresses as members is also not supported.
Workaround: None
Issue 2936436: NSX Manager UI does not show the NCP version on the container cluster page
When NSX Manager UI displays the container clusters in the inventory tab, the NCP version is not displayed.
Workaround: The NCP version is available by calling the API /policy/api/v1/fabric/container-clusters.
Issue: 2961789: After migrating manager objects to policy, some of the health-check pod's related resources cannot be deleted
After migrating manager objects to policy, when you delete the health-check pod, the pod's related segment port and the distributed firewall rule's target group are not deleted.
Workaround: Manually delete those resources.
Issue: 2966586: After migrating manager objects to policy, namespace creation fails
If an IP block is created in manager mode, after manager objects are migrated to policy, namespace creation fails because NCP cannot allocate subnets from this IP block.
Workaround: Create new IP blocks in policy mode and configure NCP to use these new IP blocks.
Issue 2972811: In a large-scale environment, the hyperbus connection to some worker nodes is down
In a large-scale environment, pod creation can get stuck for 10-15 minutes due to rpc channel timeout. The following issues may occur:
In a Kubernetes cluster, some pods will have the status ContainerCreating for 10-15 minutes.
In cfgAgent, the tunnel will have the status COMMUNICATION_ERROR for 10-15 minutes.
In NSX UI, there may be an alarm generated which indicate hyperbus connection down.
Workaround: None needed. This issue will automatically recover after 10-15 minutes.
Issue 2960121: For services of type LoadBalancer connectivity to pods on windows worker nodes fails if not configured correctly
For services of type LoadBalancer connectivity to pods on Windows worker nodes will fail if NCP is configured to use the default LB segment subnet. The default subnet 169.254.128.0/22 belongs to the IPv4 link-local space and is not forwarded on a Windows node.
Workaround: Configure NCP to use a non-default LB segment subnet. To do this, set the parameter lb_segment_subnet in the nsx_v3 section. Note that this will only have effect on newly created NSX load balancers.
Issue 3082030: Deployment of application instances on a Diego Cell fails after an NSX cfgAgent service restart
In some cases, after an NSX cfgAgent service restart on the hypervisor, the nsx-node-agent instance running in a Diego Cell might reject hyperbus connections. As a result, deployment of application instances on the Diego Cell will fail. The Diego Cell will still be in HEALTHY state. NSX-node-agent logs on the Diego cell will have a log message such as "nsx_ujo.agent.agent Agent is exiting as connection is unavailable." NSX syslog on the ESXi host will have a message such as "[nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-net" tid="XXXXXX" level="warn"] StreamConnection[YYY Connecting to tcp://169.254.1.XX:2345 sid:YYY] Couldn't connect to 'tcp://169.254.1.XX:2345' (error: 111-Connection refused)."
Workaround:
Open a SSH session to the affected Diego Cell and restart the nsx-node-agent service with the command "monit restart nsx-node-agent"
Rebuild the affected Diego Cell.
Issue 3066449: Namespace subnets are not always allocated from the first available IP block when use_ip_blocks_in_order is set to True
When creating multiple namespaces with use_ip_blocks_in_order set to True, the first namespace's subnet is sometimes not allocated from the first available IP block. For example, assume that container_ip_blocks = '172.52.0.0/28,172.53.0.0/28', and subnet prefix length is 29,and subnet 172.52.0.0/29 is already allocated. If you create 2 namespaces ns-1 and ns-2, the subnets allocation could be (1) ns-1: 172.52.0.8/29, ns-2: 172.53.0.0/29, or (2) ns-1: 172.53.0.0/29, ns-2: 172.52.0.8/29.
The use_ip_blocks_in_order parameter only ensures that different IP blocks are used in the order they appear in the container_ip_blocks parameter. When creating multiple namespaces at the same time, any namespace may request a subnet through an API call before another namespace. Therefore, there is no guarantee that a specific namespace will be allocated a subnet from a specific IP block.
Workaround: Create the namespaces separately, that is, create the first namespace, make sure its subnet has been allocated, and then create the next namespace.