VMware NSX Container Plugin 3.2.0.1   |   11 November, 2021   |   Build 18876345

Check regularly for additions and updates to this document.

What's in the Release Notes

The release notes cover the following topics:

What's New

NSX Container Plugin (NCP) 3.2.0.1 has the following new features:
Note: All new features are supported in Policy mode only unless Manager mode support is mentioned.
  • Support for using the same wildcard certificate in Kubernetes Ingress for multiple Kubernetes clusters (multiple NCP instances).
  • Support for installing OpenShift with NCP using the IPI (Installer Provisioned Infrastructure) method, in addition to the UPI (User Provisioned Infrastructure) method.
  • New “baseline_policy_type: allow_namespace_strict” configuration option. With this option set, the default policy is to drop any communication between namespaces and to outside services unless specifically allowed by Kubernetes Network Policy or NSX-T manual Admin Policies.
  • Support for the new endPort field for Kubernetes Network Policies.
  • Ability to specify subnets for a Kubernetes namespace. (If specifying IPv6 subnets, 64-bit subnet prefix is not supported.)
  • StatefulSet persistent IP allocation based on namespace subnet.
  • Ability to create client-IP based session affinity for a Kubernetes service of type ClusterIP. This feature is also available in Manager mode.
  • TAS in Policy mode.

Deprecation Notice

The annotation "ncp/whitelist-source-range" will be deprecated in NCP 3.3. Starting with NCP 3.1.1, you can use the annotation "ncp/allowed-source-range" instead.

Compatibility Requirements

Product Version
NCP/NSX-T Tile for Tanzu Application Service (TAS) 3.2
NSX-T 3.1.2, 3.1.3
vSphere 6.7, 7.0
Kubernetes 1.19, 1.20, 1.21, 1.22
OpenShift 4 RHCOS 4.6, 4.7, 4.8
Kubernetes Host VM OS Ubuntu 18.04, 20.04
CentOS 7.8, 7.9, 8.2
RHEL 8.2, 8.4
See notes below.
Tanzu Application Service Ops Manager 2.7 + TAS 2.7 (LTS) (End of support date: 30 April 2022)
Ops Manager 2.10 + TAS 2.10 (End of support date: 31 March 2022)
Ops Manager 2.10 + TAS 2.11
Ops Manager 2.10 + TAS 2.12 (End of support date: 31 March 2023)
Tanzu Kubernetes Grid Integrated (TKGI) 1.13

Notes:

The installation of the nsx-ovs kernel module on CentOS/RHEL requires a specific kernel version. The supported CentOS/RHEL kernel versions are 1127, 1160, 193, 240, and 305, regardless of the CentOS/RHEL version. Note that the default kernel version is 1127 for RHEL 7.8, 1160 for RHEL 7.9, 193 for RHEL 8.2, 240 for RHEL 8.3 and 305 for RHEL 8.4. If you are running a different kernel version, you can (1) Modify your kernel version to one that is supported. When modifying the kernel version and then restarting the VM, make sure that the IP and static routes are persisted on the uplink interface (specified by ovs_uplink_port) to guarantee that connectivity to the Kubernetes API server is not lost. Or (2) Skip the installation of the nsx-ovs kernel module by setting "use_nsx_ovs_kernel_module" to "False" under the "nsx_node_agent" section in the nsx-node-agent config map.

To run the nsx-ovs kernel module on RHEL 8.2 (kernel version 193), you must disable the "UEFI secure boot" option under "Boot Options" in the VM's settings in vCenter Server.

Starting with NCP 3.1.2, the RHEL image will not be distributed. For all supported integrations, use the Red Hat Universal Base Image (UBI). For more information, see https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image.

Support for upgrading to this release:

  • All previous 3.1.x releases

 

Resolved Issues

  • For a Kubernetes service of type ClusterIP, Client-IP based session affinity is not supported

    NCP does not support Client-IP based session affinity for a Kubernetes service of type ClusterIP.

    Workaround: None

  • Issue 2697547: HostPort not supported on RHEL/CentOS/RHCOS nodes

    You can specify hostPorts on native Kubernetes and TKGI on Ubuntu nodes by setting 'enable_hostport_snat' to True in nsx-node-agent ConfigMap. However, on RHEL/CentOS/RHCOS nodes hostPort is not supported and the parameter 'enable_hostport_snat' is ignored.

    Workaround: None

  • Issue 2713782: NSX API calls fail with the error "SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC"

    Occasionally, at NCP startup, NCP might restart or fail to initialize load balancer services due to the presence of a duplicated load balancing server or a tier-1 logical router for the load balancer. Also, while NCP is running, an NSX endpoint might be reported as DOWN for a brief period of time (less than 1 second). If the load balancer fails to initialize, the NCP log will have the message "Failed to initialize loadbalancer services." 

    This behavior will only occur when NCP is doing client-side load balancing across multiple NSX manager instances. It will not occur when a single API endpoint is configured in ncp.ini.

    Workaround: Increase the value of the nsx_v3.conn_idle_timeout parameter. Note that this might result in a longer wait time for endpoints to be detected as being available after a temporary disconnection when using client-side load balancing.

  • Issue 2795268: Connection between nsx-node-agent and hyperbus flips and Kubernetes pod is stuck at creating state

    In a large-scale environment, nsx-node-agent might fail to connect to Kubernetes apiserver to get pod information. Because of the large amount of information being transferred, keepalive messages cannot be sent to hyperbus, and hyperbus will close the connection.

    Workaround: Restart nsx-node-agent. Make sure Kubernetes apiserver is available and the certificate to connect to apiserver is correct.

Known Issues

  • Issue 2131494: NGINX Kubernetes Ingress still works after changing the Ingress class from nginx to nsx

    When you create an NGINX Kubernetes Ingress, NGINX create traffic forwarding rules. If you change the Ingress class to any other value, NGINX does not delete the rules and continues to apply them, even if you delete the Kubernetes Ingress after changing the class. This is a limitation of NGINX.

    Workaround: To delete the rules created by NGINX, delete the Kubernetes Ingress when the class value is nginx. Than re-create the Kubernetes Ingress.

  • For a Kubernetes service of type ClusterIP, the hairpin-mode flag is not supported

    NCP does not support the hairpin-mode flag for a Kubernetes service of type ClusterIP.

    Workaround: None

  • Issue 2224218: After a service or app is deleted, it takes 2 minutes to release the SNAT IP back to the IP pool

    If you delete a service or app and recreate it within 2 minutes, it will get a new SNAT IP from the IP pool.

    Workaround: After deleting a service or app, wait 2 minutes before recreating it if you want to reuse the same IP.

  • Issue 2404302: If multiple load balancer application profiles for the same resource type (for example, HTTP) exist on NSX-T, NCP will choose any one of them to attach to the Virtual Servers.

    If multiple HTTP load balancer application profiles exist on NSX-T, NCP will choose any one of them with the appropriate x_forwarded_for configuration to attach to the HTTP and HTTPS Virtual Server. If multiple FastTCP and UDP application profiles exist on NSX-T, NCP will choose any one of them to attach to the TCP and UDP Virtual Servers, respectively. The load balancer application profiles might have been created by different applications with different settings. If NCP chooses to attach one of these load balancer application profiles to the NCP-created Virtual Servers, it might break the workflow of other applications.

    Workaround: None

  • Issue 2518111: NCP fails to delete NSX-T resources that have been updated from NSX-T

    NCP creates NSX-T resources based on the configurations that you specify. If you make any updates to those NSX-T resources through NSX Manager or the NSX-T API, NCP might fail to delete those resources and re-create them when it is necessary to do so.

    Workaround: Do not update NSX-T resources created by NCP through NSX Manager or the NSX-T API.

  • Issue 2416376: NCP fails to process a TAS ASG (App Security Group) that binds to more than 128 Spaces

    Because of a limit in NSX-T distributed firewall, NCP cannot process a TAS ASG that binds to more than 128 Spaces.

    Workaround: Create multiple ASGs and bind each of them to no more than 128 Spaces.

  • Issue 2537221: After upgrading NSX-T to 3.0, the networking status of container-related objects in the NSX Manager UI is shown as Unknown

    In NSX Manager UI, the tab Inventory > Containers shows container-related objects and their status. In a TKGI environment, after upgrading NSX-T to 3.0, the networking status of the container-related objects is shown as Unknown. The issue is caused by the fact that TKGI does not detect the version change of NSX-T. This issue does not occur if NCP is running as a pod and the liveness probe is active.

    Workaround: After the NSX-T upgrade, restart the NCP instances gradually (no more than 10 at the same time) so as not to overload NSX Manager.

  • Issue 2552564: In an OpenShift 4.3 environment, DNS forwarder might stop working if overlapping address found

    In an OpenShift 4.3 environment, cluster installation requires that a DNS server be configured. If you use NSX-T to configure a DNS forwarder and there is IP address overlap with the DNS service, the DNS forwarder will stop working and cluster installation will fail.

    Workaround: Configure an external DNS service, delete the cluster that failed to install and recreate the cluster.

  • Issue 2555336: Pod traffic not working due to duplicate logical ports created in Manager mode

    This issue is more likely to occur when there are many pods in several clusters. When you create a pod, traffic to the pod does not work. NSX-T shows multiple logical ports created for the same container. In the NCP log only the ID of one of the logical ports can be found. 

    Workaround: Delete the pod and recreate it. The stale ports on NSX-T will be removed when NCP restarts.

  • Issue 2597423: When importing manager objects to policy, a rollback will cause the tags of some resources to be lost

    When importing manager objects to policy, if a rollback is necessary, the tags of the following objects will not be restored:

    • Spoofguard profiles (part of shared and cluster resources)
    • BgpneighbourConfig (part of shared resources)
    • BgpRoutingConfig (part of shared resources)
    • StaticRoute BfdPeer (part of shared resources)

    Workaround: For resources that are part of the shared resources, manually restore the tags. Use the backup and restore feature to restore resources that are part of cluster resources.

  • Issue 2579968: When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools are not be deleted as expected

    When changes are made to Kubernetes services of type LoadBalancer at a high frequency, some virtual servers and server pools might remain in the NSX-T environment when they should be deleted.

    Workaround: Restart NCP. Alternatively, manually remove stale virtual servers and their associated resources. A virtual server is stale if no Kubernetes service of type LoadBalancer has the virtual server's identifier in the external_id tag.

  • NCP fails to start when "logging to file" is enabled during Kubernetes installation

    This issue happens when uid:gid=1000:1000 on the container host does not have permission to the log folder.

    Workaround: Do one of the following:

    • Change the mode of the log folder to 777 on the container hosts.
    • Grant “rwx” permission of the log folder to uid:gid=1000:1000 on the container hosts.
    • Disable the “logging to file” feature.
  • Issue 2653214: Error while searching the segment port for a node after the node's IP address was changed

    After changing a node's IP address, if you upgrade NCP or if the NCP operator pod is restarted, checking the NCP operator status with the command "oc describe co nsx-ncp" will show the error message "Error while searching segment port for node ..."

    Workaround: None. Adding a static IP address on a node interface which also has DHCP configuration is not supported.

  • Issue 2664457: While using DHCP in OpenShift, connectivity might be temporarily lost when nsx-node-agent starts or restarts

    nsx-ovs creates and activates 5 temporary connection profiles to configure ovs_bridge but their activation might keep failing temporarily in NetworkManager. As a result, no IP (connectivity) is present on the VM on ovs_uplink_port and/or ovs_bridge.

    Workaround: Restart the VM or wait until all the profiles can be successfully activated by NetworkManager.

  • Issue 2672677: In a highly stressed OpenShift 4 environment, a node can become unresponsive

    In an OpenShift 4 environment with a high level of pod density per node and a high frequency of pods getting deleted and created, a RHCOS node might go into a "Not Ready" state. Pods running on the affected node, with the exception of daemonset members, will be evicted and recreated on other nodes in the environment.

    Workaround: Reboot the impacted node.

  • Issue 2707174: A Pod that is deleted and recreated with the same namespace and name has no network connectivity

    If a Pod is deleted and recreated with the same namespace and name when NCP is not running and nsx-ncp-agents are running, the Pod might get wrong network configurations and not be able to access the network.

    Workaround: Delete the Pod and recreate it when NCP is running.

  • Issue 2745904: The feature "Use IPSet for default running ASG" does not support removing or replacing an existing container IP block

    If you enable "Use IPSet for default running ASG" on an NCP tile, NCP will create a dedicated NSGroup for all the container IP blocks configured by "IP Blocks of Container Networks" on the same NCP tile. This NSGroup will be used in the firewall rules created for global running ASGs to allow traffic for all the containers. If you later remove or replace an existing container IP block, it will be removed or replaced in the NSGroup. All the existing containers in the original IP block will no longer be associated with the global running ASGs. Their traffic might no longer work.

    Workaround: Only append new IP blocks to "IP Blocks of Container Networks".

  • Issue 2745907: "monit" commands return incorrect status information for nsx-node-agent

    On a diego_cell VM, when monit restarts nsx-node-agent, if it takes more than 30 seconds for nsx-node-agent to fully start, monit will show the status of nsx-node-agent as "Execution failed" and will not update its status to "running" even when nsx-node-agent is fully functional later.

    Workaround: None.

  • Issue 2735244: nsx-node-agent and nsx-kube-proxy crash because of liveness probe failure

    nsx-node-agent and nsx-kube-proxy use sudo to run some commands. If there are many entries in /etc/resolv.conf about DNS server and search domains, sudo can take a long time to resolve hostnames. This will cause nsx-node-agent and nsx-kube-proxy to be blocked by the sudo command for a long time, and liveness probe will fail.

    Workaround: Perform one of the two following actions:

    • Add hostname entries to /etc/hosts. For example, if hostname is 'host1', add the entry '127.0.0.1   host1'.
    • Set a larger value for the nsx-node-agent liveness probe timeout. Run the command 'kubectl edit ds nsx-node-agent -n nsx-system' to update the timeout value for both the nsx-node-agent and nsx-kube-proxy containers.
  • Issue 2744557: Complex regular expression patterns containing both a capture group () and {0} not supported for Ingress path matching

    For example, if the regular expression (regex) pattern is: /foo/bar/(abc){0,1}, it will not match /foo/bar/.

    Workaround: Do not use capture group () and {0} when creating an Ingress regex rule. Use the regular pattern EQUALS to match /foo/bar/.

  • Issue 2736412: Parameter members_per_small_lbs is ignored if max_allowed_virtual_servers is set

    If both max_allowed_virtual_servers and members_per_small_lbs are set, virtual servers may fail to attach to an available load balancer because only max_allowed_virtual_servers is taken into account.

    Workaround: Relax the scale constraints instead of enabling auto scaling.

  • Issue 2740552: When deleting a static pod using api-server, nsx-node-agent does not remove the pod's OVS bridge port, and the network of the static pod which is re-created automatically by Kubernetes is unavailable

    Kubernetes does not allow removing a static pod by api-server. A mirror pod of static pod is created by Kubernetes so that the static pod can be searched by api-server. While deleting the pod by api-server, only the mirror pod will be deleted and NCP will receive and handle the delete request to remove all NSX resource allocated for the pod. However, the static pod still exists, and nsx-node-agent will not get the delete request from CNI to remove OVS bridge port of static pod. 

    Workaround: Remove the static pod by deleting the manifest file instead of removing the static pod by api-server.

  • Issue 2795482: Running pod stuck in ContainerCreating state after node/hypervisor reboot or any other operation

    If the wait_for_security_policy_sync flag is true, a pod can go to ContainerCreating state after being in running state for more than one hour because of a worker node hard reboot,  hypervisor reboot, or some other reason. The pod will be in the creating state forever.

    Workaround: Delete and recreate the pod.

  • Issue 2860091: DNS traffic fails if baseline_policy_type is set to allow_namespace

    In an OpenShift or Kubernetes environment, if baseline_policy_type is set to allow_namespace, it will block pods (hostNetwork: False) in other namespaces from accessing the DNS service.

    Workaround: Add a rule network policy to allow traffic from other pods to the DNS pods.

  • Issue 2841030: With Kubernetes 1.22, the status of nsx-node-agent is always 'AppArmor'

    With Kubernetes 1.22, when the nsx-node-agent pods are "Ready", their status is not updated from "AppArmor" to "Running". This does not impact the functionality of NCP or nsx-node-agent.

    Workaround: Restart the nsx-node-agent pods.

  • Issue 2824129: A node has the status network-unavailable equal to true for more than 3 minutes after a restart

    If you use NCP operator to manage NCP's lifecycle, when an nsx-node-agent daemonset recovers from a non-running state, its node will have the status network-unavailable equal to true until it has been running for 3 minutes. This is expected behavior.

    Workaround: Wait for at least 3 minutes after nsx-node-agent restarts.

  • Issue 2867361: nsx-node-agent and hyperbus alarms not removed after NCP cleanup

    If nsx-node-agent and hyperbus alarms appear for some reason (such as stopping all NSX node agents), and you stop NCP and run the cleanup script, the alarms will remain after the cleanup.

    Workaround: None

  • Issue 2869247: On Ubuntu 18.04 host OS, nsx-ovs container keeps restarting

    The nsx-ovs container keeps restarting because its Liveness Probe keeps failing. /var/log/nsx-ujo/openvswitch/ovs-vswitchd.log contains log that indicates it keeps restarting. For example:

    2021-10-28T17:38:49.364Z|00004|backtrace(monitor)|WARN|Backtrace using libunwind not supported. 
    2021-10-28T17:38:49.364Z|00005|daemon_unix(monitor)|WARN|2 crashes: pid 282 died, killed (Aborted), core dumped, waiting until 10 seconds since last restart 
    2021-10-28T17:38:57.364Z|00006|daemon_unix(monitor)|ERR|2 crashes: pid 282 died, killed (Aborted), core dumped, restarting 

    This issue is caused by an incompatibility between the installed upstream OVS user space packages in the nsx-ovs container and the OVS kernel module on the host.

    Workaround: Upgrade the host OS to Ubuntu 20.0 or perform the steps below to use the OVS kernel modules provided by NSX:

    1. Set use_nsx_ovs_kernel_module to True in nsx-node-agent's config map.
    2. Uncomment the volume mounts in nsx-ncp-bootstrap DaemonSet (search for "Uncomment these mounts if installing NSX-OVS kernel module" and "Uncomment these volumes if installing NSX-OVS kernel module") in the ncp-ubuntu*.yaml file.
    3. Re-apply the ncp-ubuntu*.yaml file and restart the nsx-node-agent pods.
  • Issue 2867871: Access to clusterIP service from pods that the service is referencing fails if the Kubernetes node name of the pods is different from the host name

    NCP currently supports pod self-access to clusterIP service only when the Kubernetes node name is the same as the host name. This is because nsx-kube-proxy adds self-access flow only if the hostname is the same as the node name.

    Workaround: None

  • Issue 2868572: Open vSwitch (OVS) must be disabled on host VM before running NCP

    To deploy NCP on a host VM, you must first stop OVS-related processes and delete some files on the host using the following commands:

    1. sudo systemctl disable openvswitch-switch.service
    2. sudo systemctl stop openvswitch-switch.service
    3. rm -rf /var/run/openvswitch

    If you have already deployed NCP on a host VM, and OVS is not running correctly, perform the following steps to recover:

    1. Perform the above 3 steps.
    2. Delete nsx-node-agent pods on the nodes having the issue to restart the node agent pods with the command "kubectl delete pod $agent-pod -n nsx-system".

    Workaround: See above.

  • Issue 2832480: For a Kubernetes service of type ClusterIP, sessionAffinityConfig.clientIP.timeoutSeconds cannot exceed 65535

    For a Kubernetes service of type ClusterIP, if you set sessionAffinityConfig.clientIP.timeoutSeconds to a value greater than 65535, the actual value will be 65535.

    Workaround: None

  • Issue 2882699: In an IPv6 environment, setting baseline_policy_type to allow_namespace_strict causes communication failure

    In an IPv6 environment, with baseline_policy_type set to allow_namespace_strict, pods cannot access Kubernetes nodes.

    Workaround: Add a distributed firewall rule with a higher priority than the baseline rule to allow traffic from pods to Kubernetes nodes.

  • Issue 2939886: Migrating objects from Manager Mode to Policy Mode fails

    Migrating objects from Manager Mode to Policy Mode fails if, in the network policy specification, egress and ingress have the same selector.

    Workaround: None

  • Issue 3033821: After manager-to-policy migration, distributed firewall rules not enforced correctly

    After a manager-to-policy migration, newly created network policy-related distributed firewall (DFW) rules will have higher priority than the migrated DFW rules.

    Workaround: Use the policy API to change the sequence of DFW rules as needed.

  • Issue 3042916: nsx-kube-proxy fails after startup with the error "invalid or unknown port for in_port"

    On rare occasions, nsx-kube-proxy will fail shortly after startup because the OVS uplink "ofport" is empty at that time. The log has the error message "RuntimeError: Fatal error executing xxx: (): invalid or unknown port for in_port."

    Workaround: Reboot nsx-kube-proxy.

  • Issue 3239352: In a TAS environment, when a Task cannot be allocated, retry may not work

    In an NCP TAS environment, when a Task cannot be allocated the Auctioneer rejects the task and the BBS retries placement of the task up to the number of times specified by the setting task.max_retries. When task.max_retries is reached, the BBS updates the Task from the PENDING state to the COMPLETED state, marking it as Failed and including a FailureReason that explains that the cluster has no capacity for the task.

    During retry, the task may be scheduled  to a new cell which notifies NCP with a task_changed event. Since NCP does not handle the task_changed event the task cannot be assigned a new port in the new cell.  The task cannot run properly.

    Workaround: Disable the retry and set the task.max_retries value to 0.

check-circle-line exclamation-circle-line close-line
Scroll to top icon