VMware vSphere Container Storage Plug-in 3.x | 25 JUN 2024

Check for additions and updates to these release notes.

vSphere Container Storage Plug-in 3.3.0

What's New

  • Support for Kubernetes version 1.30.

  • Support for ReadWriteOnce (RWO) and ReadWriteMany (RWX) CSI volumes with vSAN Max deployments within the same vCenter. Requires vCenter 8.0 Update 3.

  • Support for ReadWriteMany (RWX) volumes in HCI Mesh with topology-aware environments within the same vCenter. Requires vCenter 8.0 Update 3.

  • Updates to Kubernetes cluster distribution map to record EKS and AKS clusters. See #2880.

  • Allow GetDevMounts to resolve symlinks found in the mount table before comparing it to RealDev. This helps in resolving CSI mount issue on RHEL8/RHEL9 when multipath is enabled. See #2593.

  • Removed x509sha1 support and removed support of overriding tlsmaxrsasize value. See #2877.

  • Leverage nodeAffinity rather than nodeSelector to target Control Plane nodes in CSI driver deployment. See #2803 and #2644.

Deployment Files

Important:

Starting from version 3.2.0, the internal-feature-states.csi.vsphere.vmware.com configmap contains only alpha or internal features.

Version

File

Version 3.3.1

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.3.1/manifests/vanilla

Version 3.3.0

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.3.0/manifests/vanilla

Kubernetes Releases

Version

Kubernetes Releases

Version 3.3.1

  • Minimum: 1.27

  • Maximum: 1.30

Version 3.3.0

  • Minimum: 1.28

  • Maximum: 1.30

Supported Sidecar Container Versions

  • csi-provisioner: v4.0.1

  • csi-attacher: v4.5.1

  • csi-resizer: v1.10.1

  • livenessprobe: v2.12.0

  • csi-node-driver-registrar: v2.10.1

  • csi-snapshotter: v7.0.2

Resolved Issues

Version

Resolved Issues

Version 3.3.1

The 3.3.1 patch release includes the following fixes specifically targeted at reducing vCenter sessions created by the vSphere CSI driver:

  • While creating a new govmomi client, the vSpehre CSI driver will clear idle sessions associated with the existing client. For details, see PR #2930.

  • Change the existing leader election package to the client-go package to override OnStoppedLeading callback and cleanup sessions. For details, see PR #2948.

  • Remove the listview-tasks feature gate to use a single govmomi client throughout a container. For details, see PR #2851.

Version 3.3.0

  • Changes to fix a volume mount issue on Windows worker nodes when a node reboots and comes back up. See PR #2868 for more details.

  • Fixed an issue where two pods using the same PVC are running on the same node. The issue occurred when the PVC was mounted with the ReadWrite mount permission on the first pod, and with the ReadOnly mount permission on the second pod. Under these circumstances, the mount point on the first pod also became ReadOnly. For more details, see PR #2861.

  • Fixed Thumbprint based authentication for multi vCenter deployment. For more details, see PR #2858 and GitHub issue #2823.

  • Fixed volume provisioning when duplicate region or zone tags exist. For more details, see PR #2814 and GitHub issue #2681.

  • Use lock per VolumeId during volume mount or unmount operations to avoid conflicts during multiple mount or unmount API calls. For details, see PR #2811.

  • Changes to handle expired session for ListView feature. Refer to PR #2788 for more details.

vSphere Container Storage Plug-in 3.2.0

What's New

  • Support for Kubernetes version 1.29.1.

  • vSAN stretched cluster support for RWX volumes.

  • Support for dynamic and static volume provisioning for RWX volumes in topology-aware environments and multi-vCenter environments.

  • Success and failure reporting for static volume provisioning requests as events on PersistentVolumes. See #2725 and #2765.

  • Metrics: Ability to collect the absolute number of requests made by the vSphere CSI driver per API. See #2520.

  • The FSS configmap (internal-feature-states.csi.vsphere.vmware.com) now lists only the alpha or internal features.

Deployment Files

Important:

Starting from version 3.2.0, the internal-feature-states.csi.vsphere.vmware.com configmap contains only alpha or internal features.

Version

File

Version 3.2.0

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.2.0/manifests/vanilla

Kubernetes Releases

  • Minimum: 1.27

  • Maximum: 1.29

Supported Sidecar Containers Versions

  • csi-provisioner: v4.0.0

  • csi-attacher: v4.5.0

  • csi-resizer: v1.10.0

  • livenessprobe: v2.12.0

  • csi-node-driver-registrar: v2.10.0

  • csi-snapshotter: v7.0.1

Resolved Issues

Version

Resolved Issues

Version 3.2.0

  • Extended the validity of the webhook container certificate to 180 days which reduces the need for frequent certificate renewals. See #2739.

  • Allow creation of custom server certificates with key size upto 16384. See #2777.

Known Issues

  • Persistent Volume Claim (PVC) with ReadWriteMany access get stuck in the Terminating State after PVC deletion, if there is a corresponding Volume Snapshot (VS) associated with the PVC.

    CSI does not support VS for file volumes. It is only supported for block volumes. See Volume Snapshot and Restore Requirements. If you create a VS for file volumes and then perform a PVC deletion, the PVC remains in the Terminating state. This is due to the snapshot.storage.kubernetes.io/pvc-as-source-protection finalizer, which is added by the external-snapshotter.

    Workaround: Follow any one of the following methods:

    • Manually delete the VS for file volumes and then delete the PVC. This will ensure that the PVC does not enter the Terminating state.

    • Remove the snapshot.storage.kubernetes.io/pvc-as-source-protection finalizer from the PVC. This will clean up the volume.

  • Migration of in-tree vSphere volumes to CSI does not work with Kubernetes version 1.29.0.

    Migration of in-tree vSphere volumes to CSI does not work with Kubernetes version 1.29.0. See #122340.

    Workaround: Use Kubernetes version 1.29.1 or later to migrate in-tree vSphere volumes to CSI .

  • Deployment pods remain stuck in the Terminating or Pending state

    The Linux file system on the node VM undergoes a transition to read-only status following an unexpected ESXi host power off or network interface card disconnect and subsequent recovery. This issue leads to the inability to initiate new pods, and the existing pods on that node fails to write new data to the volume.

    Workaround: Reboot the node VM.

vSphere Container Storage Plug-in 3.1.0

What's New

Version

What's New

Version 3.1.2

  • Includes the fix for the vSphere CSI Driver CrashloopBackOff issue in TKGm deployments with thumbprint in vsphere-config-secret.

  • Validate vCenter user name and disallow user name without a domain name.

  • Update Kubernetes libraries to version 1.26.8, which includes fixes for [CVE-2023-3955] and [CVE-2023-3676].

Version 3.1.1

  • Fixed the issue that occurs during new volume creation when a deleted node is added back to the cache.

  • Fixed the issue where the CSI Driver issued continuous attach volume tasks for migrated in-tree volumes even when volumes are attached to a node.

  • Handle uninitialized volumeMigrationService on multi-vCenter deployments.

Version 3.1.0

This minor release fixes issues observed in the 3.0.2 release and includes these changes:

  • Support for Kubernetes version 1.28.

  • HCI mesh support for CNS block volumes.

  • Added resize capability for migrated in-tree vSphere volumes.

  • Enhanced vCenter Server session management to ensure minimum number of sessions are kept open from vSphere Container Storage Plug-in to vCenter Server.

  • Reduced number of vpxd connection created by vSphere Container Storage Plug-in on vCenter Server enhancing vCenter Server task monitoring workflow.

Deployment Files

Important:

To ensure proper functionality, do not update the internal-feature-states.csi.vsphere.vmware.com configmap available in the deployment YAML file. VMware does not recommend to activate or deactivate features in this configmap.

Version

File

Version 3.1.2

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.1.2/manifests/vanilla

Version 3.1.1

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.1.1/manifests/vanilla

Version 3.1.0

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.1.0/manifests/vanilla

Kubernetes Releases

  • Minimum: 1.26

  • Maximum: 1.28

Supported Sidecar Containers Versions

  • csi-provisioner: v3.5.0

  • csi-attacher: v4.3.0

  • csi-resizer: v1.8.0

  • livenessprobe: v2.10.0

  • csi-node-driver-registrar: v2.8.0

  • csi-snapshotter: v6.2.2

Resolved Issues

Version

Resolved Issues

Version 3.1.0

  • Fixed the logic of selecting a datastore in a topology aware setup to ensure local host datastores are not selected for volume provisioning when topology is defined at a cluster or a higher level entity.

  • Prevent deletion of critical migration CRs when volumes are temporarily lost from vCenter Server.

  • Extended the validity of the webhook container certificate to 180 days which reduces the need for frequent certificate renewals.

Known Issues

  • The health status for container file volumes may remain in the Unknown state

    In the vCenter UI, the health status for container file volumes may remain in the Unknown state up to an hour.

    Workaround: Click Retest in the vCenter UI to view the latest health check results.

  • When a worker node is rebooted or gets crashed, the pod running on the node goes to the Unknown state

    When worker node is shut down or reboot, unmounting and removal of the staging target path directory may not be possible or may fail for various reasons, which would leave the staging directory on that node as it is. So, when the pod gets re-scheduled on that same node and when kubelet tries to create the directory again, it fails.

    The following error is displayed in the corresponding pod description for both the shutdown and reboot scenarios:

    Warning  FailedMount  9m11s (x2131 over 3d)  kubelet  MountVolume.MountDevice failed for volume "pvc-X-X-X-X-X" : kubernetes.io/csi: attacher.MountDevice failed to create dir "\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\csi.vsphere.vmware.com\\XXXXX\\globalmount":  mkdir \var\lib\kubelet\plugins\kubernetes.io\csi\csi.vsphere.vmware.com\XXXXX\globalmount: Cannot create a file when that file already exists.
    Warning  FailedMount  3m42s (x1495 over 3d)  kubelet  Unable to attach or mount volumes: unmounted volumes=[<abc>], unattached volumes=[<abc> kube-api-access-8dqhl]: timed out waiting for the condition

    If you manually delete the pod and reschedule it on the same node, it does not enter the"Running" state and remains in the "ContainerCreating" state forever with same error.

    Workaround:

    1. Delete the pod in the unknown state and reschedule it on another node.

    2. Clean up the staging target directory by deleting it manually from affected Windows worker node by using the following command:

      Remove-Item -Force <staging-target-path>

  • Volume operations are not added to ListView due to the session not being authenticated

    The issue can be observed by users after a combination of a credential rotation event and the client session being inactive. In the event of a rotation, CSI does not immediately shift to using the new credentials and instead switches when the current session expires. However, a couple of control flows can continuously use the older session even in case of session expiry, which can lead to this issue.

    Workaround: Restart the CSI driver pod where this issue occurs.

  • In vCenter 7.0 Update 3 and earlier versions, the CNS UI does not display volume placement details for CNS volumes on remote vSAN datastores. Additionally, the View Virtual Objects option does not function as intended.

    In a HCI Mesh deployment model with CNS Container Volumes, the physical placement of volumes backed by the server cluster is not shown and navigation from CNS volume to vSAN virtual objects view does not work. Physical placement and navigation for volumes backed by the client cluster work here. This issue is observed only for vCenter 7.0 Update 3 and below versions.

    Workaround:

    Copy the backingObjectId of a volume from the CNS UI and use it to filter by UUID in vSAN virtual objects view UI.

  • After deleting a node vm and creating a new node vm with the same name, volume operations such as PVC creation fails with the error message The object 'vim.VirtualMachine:<vm-moref>' has already been deleted or has not been completely created. This happens due to node recreation with the same name, and if the node manager cache in vSphere Container Storage Plug-in is not refreshed.

    After recreating the node vm, restart vSphere Container Storage Plug-in controller deployment. This is to ensure that the node manager cache is refreshed in vSphere Container Storage Plug-in.

    Workaround: run the following command:

    kubectl rollout restart deployment vsphere-csi-controller -n vmware-system-csi

  • When vCenter Server goes down, CSI snapshot creation tasks fail with an error message

    When vCenter Server goes down, any snapshot creation tasks that are in progress are interrupted. vSphere Container Storage Plug-in cannot get an update on the status of these tasks. As a result, it will continue to retry the operations at Kubernetes level. This will cause the snapshots to remain in a not ready state.

    You can see the following error message: failed to create snapshot on volume <volume-ID>: Failed to get taskInfo for CreateSnapshots task from vCenter <VC-IP> with err:

    Once vCenter Server becomes accessible, delete and recreate the affected snapshots that are still in not ready state. Make sure to set ReadyToUse field to false.

  • Volume attachment fails after recovering from Kubernetes infrastructure failures

    vSphere HA attempts to migrate worker node virtual machines from one host to another when it recovers from infrastructure failures like host reboots and downtimes. Once the virtual machine is successfully moved, if you attempt to power it on, Unable to write VMX File error message appears.

    This error message occurs because the virtual machine's configuration cannot locate the volume associated with the specified backing disk ID.

    As a result, vSphere Container Storage Plug-in receives a VM NotFound error. When the pod using this volume is rescheduled to another node, CSI reports a successful detach volume response. However, the volume detachment process is yet to be completed at the back end, and the disk remains attached to the previous node's virtual machine. Because of this, the rescheduled pod encounters a VolumeAttachment failure with the The resource volume is in use error.

    Workaround: Manually detach the volume from the vSphere Client using the following steps:

    1. Obtain the PersistentVolume name from the Kubernetes cluster which contains the VolumeAttachment failure due to  ResourceInUse and Unable to write VMX file error.

      kubectl get pods -n <namespace> | grep -iv Running
      Output: <pod-name>kubectl describe pod <pod-name> -n <namespace> | grep ResourceInUse
      Output: <ResourceInUseError>kubectl describe pod <pod-name> -n <namespace> | grep ClaimName
      Output: <claim-name>
      
      kubectl get pv | grep <claim-name>
      Output: <pv-name>
      
      kubectl describe pv <pv-name> | grep VolumeHandle
      Output: <volume-handle>
      
      kubectl logs <vsphere-csi-controller-pod-name> -c vsphere-csi-controller -n vmware-system-csi | grep <volume-handle> | grep "Unable to write VMX file"
      Output: {"level":"error","time":"<time-stamp>","caller":"volume/manager.go:<num>","msg":"failed to detach cns volume: \"<volume-handle>\" from node vm: VirtualMachine:vm-<moref>...Unable to write VMX file....
    2. Obtain the VM name and the VolumePath from the ContainerVolumes section of vSphere Client.

      1. In the vSphere Client Inventory section right click the virtual machine obtained from the step 1, and click Edit Settings.

      2. Under Virtual Hardware, select the hard disk to remove. The Disk File should match the VolumePath obtained from the previous step.

      3. To remove a disk, click the ellipsis icon that appears on the right.

      For more information, see Monitor Container Volumes Across Kubernetes Clusters.

vSphere Container Storage Plug-in 3.0.0

What's New

Version

What's New

Version 3.0.3

  • Telemetry enhancements. Determine cluster distribution server version and type using the Kubernetes API server version.

  • Fixed the issue that occurs during new volume creation when a deleted node is added back to the cache.

  • Fixed the issue where the CSI Driver issued continuous attach volume tasks for migrated in-tree volumes even when volumes are attached to a node.

  • Handle uninitialized volumeMigrationService on multi-vCenter deployments.

Version 3.0.2

This patch release fixes issues observed in the 3.0.0 and 3.0.1 releases and includes these changes:

  • Allow disabling of the useCSINodeID feature. A pod fails to come up when the useCSINodeID feature is disabled. See 2373 for details.

  • Fixed a segmentation problem observed when enabling the alpha feature pv-to-backingdiskobjectid-mapping. See 2370 for details.

  • Allow NTFS fsType to be uppercase. See 2305 for details.

  • Allow enabling topology if it has not been configured initially during the deployment of the vSphere Container Storage plug-in. See 2412 for details.

  • Fixed the List-volume feature to honor a vCenter Server without any nodes in the vSphere Config Secret. See 2393 for details.

  • Fixed authorization service to permit adding new vCenter Server without any datastores or nodes while working on extending topology setup to another vCenter Server. See 2443 for details.

  • Fixed pushing volume metadata to vCenter Server for migrated in-tree vSphere volume. See 2454 for details.

Version 3.0.1

  • Fixed a bug in the CSI full sync that caused the Delete PVC operation to fail due to the deletion of the CNSVolumeInfo CR. This issue has been resolved. For more information, see 2327 in GitHub.

  • Resolved the panic observed in the syncer container during the full sync process. This behavior occurred when stale CNS volumes were deleted while the corresponding PV was not present in the Kubernetes cluster. For more details, see 2347 in GitHub.

Version 3.0.0

Deployment Files

Important:

To ensure proper functionality, do not update the internal-feature-states.csi.vsphere.vmware.com configmap available in the deployment YAML file. VMware does not recommend to activate or deactivate features in this configmap.

Version

File

Version 3.0.3

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.0.3/manifests/vanilla

Version 3.0.2

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.0.2/manifests/vanilla

Version 3.0.1

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.0.1/manifests/vanilla

Version 3.0.0

https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/v3.0.0/manifests/vanilla

Kubernetes Releases

  • Minimum: 1.24

  • Maximum: 1.27

Supported Sidecar Containers Versions

  • csi-provisioner: v3.4.0

  • csi-attacher: v4.2.0

  • csi-resizer: v1.7.0

  • livenessprobe: v2.9.0

  • csi-node-driver-registrar: v2.7.0

  • csi-snapshotter: v6.2.1

Resolved Issues

Version

Resolved Issues

Version 3.0.0

  • Volume provisioning using the datastoreURL parameter in StorageClass does not work correctly when this datastoreURL points to a shared datastore mounted across datacenters. See #2187 in GitHub.

  • Mounting a PVC on a Windows node fails with Size Not Supported error. See #2080 in GitHub.

  • CreateVolume request fails with an error after you change the hostname or IP of vCenter Server in the vsphere-config-secret. See #2221 in GitHub.

Known Issues

  • The health status for container file volumes may remain in the Unknown state

    In the vCenter UI, the health status for container file volumes may remain in the Unknown state up to an hour.

    Workaround: Click Retest in the vCenter UI to view the latest health check results.

  • Attempts to run a PV and a PV created out of a snapshot of the original PV on the same Windows node might fail

    This problem occurs when a pod with a persistent volume (PV) is running on a node and you try to schedule a new pod. If the new pod uses a PV created from a snapshot of the original PV and is scheduled to run on the same node, the pod might remain in a pending state.

    The problem does not occur on Linux nodes.

    Workaround: If you must run two pods, one using a PV and another a PV created from a snapshot of the original PV, schedule the pods on different Windows nodes. You can use the node selector in the pod specification.

  • When Cloud Native Storage Manager is used with vSphere Container Storage Plug-in, automatic generation of cluster IDs is not possible

    Automatic generation of cluster IDs is not compatible with an environment that uses Cloud Native Storage Manager with vSphere Container Storage Plug-in 3.0.0.

    Workaround: Manually specify the cluster ID in the vSphere configuration secret during deployment.

  • Attempts to create a pod with XFS file system might fail

    When you try to create a pod with XFS file system using vSphere Container Storage Plug-in on CentOS 7 and Red Hat Enterprise 7 nodes, the pod remains in the pending state. The following error message appears.

    output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3303985d-0d2e-4c3d-87ab-7a25a29ad0ff/globalmount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.

    Workaround: None.

  • Migration of a pod using in-tree vSphere volumes occasionally gets stuck in theContainerCreatingstate with the error messagefailed to set keepAfterDeleteVm control flag for VolumeID

    This problem is rare and might occur when the pod is backed by deployments or statefulsets using in-tree vSphere volumes.

    Workaround:

    1. Uncordon the drained node from which you attempted to evict the pod.

    2. Cordon all other nodes.

    3. Delete the problematic pod.

      The pod will be up and running on the node to which the volumes are attached.

    4. Uncordon other nodes.

    5. Run the following command on the node to verify that the volumes are getting detached.

      kubectl drain <node-name> --ignore-daemonsets

    If, for some reason, the in-tree vSphere volume fails to detach from the node VM, you can manually detach the volume.

    1. In the vSphere Client, right-click the node virtual machine, and click Edit Settings.

    2. Under Virtual Hardware, select the hard disk to remove.

    3. Remove the disk by clicking the X icon that appears on the right.

    4. Click OK.

    vSphere 7.0 p07 and vSphere 8.0 Update 1 provides a fix that enables you to set the control flag on the volume even when it is attached to the node VM. Alternatively, it is recommended to upgrade to vSphere 7.0 p07 or vSphere 8.0 Update 1 to resolve this issue.

  • After a vSphere upgrade, vSphere Container Storage Plug-in might not pick up new vSphere features

    After a vSphere upgrade is performed in the background, the vSphere Container Storage Plug-in controller deployment needs to be restarted. This action is required to make vSphere Container Storage Plug-in pick up the new features compatible with the vSphere version.

    Workaround:

    Run the following command: kubectl rollout restart deployment vsphere-csi-controller -n vmware-system-csi

  • Under certain conditions, you might be able to provision more than three snapshots per volume despite default limitations

    By default, vSphere Container Storage Plug-in allows a maximum of three snapshots per volume. This limitation is applicable only when snapshot requests are at different time intervals. If you send multiple and parallel requests to create a snapshot for the volume at the same time, the system allows you to provision more than three snapshots per volume. Although no volume or snapshot operations are impacted, exceeding the maximum number of snapshots per volume is not recommended.

    Workaround: Avoid creating more than three snapshots on a single volume.

  • When a site failure occurs, pods that were running on the worker nodes in that site remain in Terminating state

    When a site failure causes all Kubernetes nodes and ESXi hosts in the cluster on that site to fail, the pods that were running on the worker nodes in that site will be stuck in Terminating state.

    Workaround: Start some of the ESXi hosts in the site as soon as possible, so that vSphere HA can restart the failed Kubernetes nodes. This action ensures that the replacement pods begin to come up.

  • After a recovery from network partition or host failure, some nodes in the cluster do not have INTERNAL-IP or EXTERNAL-IP

    After a recovery from a network partition or host failure, CPI is unable to assign INTERNAL-IP or EXTERNAL-IP to the node when it is added back to the cluster.

    Workaround:

    1. De-register the affected node.

      # kubectl delete node node-name

    2. Re-register the affected node by restarting kubelet service within the affected node.

      # systemctl restart kubelet

    3. Wait for node to register with the cluster.

    4. Taint the affected nodes.

      # kubectl taint node node-name node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

    5. Wait for CPI to initialize the node. Make sure ProviderID is set and IP address is present for the node.

  • After recovering from a network partition or host failure, the control plane node becomes a worker node

    During network partition or host failure, the CPI might delete the node from the Kubernetes cluster if a VM is not found. After the recovery, the control plane node might become the worker node. Because of this, the pods tend to get scheduled on it unexpectedly.

    You can fix this issue in two ways:

    Workaround 1

    1. Taint and add labels to the affected nodes.

      # kubectl taint node <node name> node-role.kubernetes.io/control-plane:NoSchedule

    2. Delete the node from cluster.

      # kubeclt delete node <node name>

    3. Restart kublet service within the affected node.

      # systemctl restart kubelet

    4. Wait for the node to register with the cluster and add labels to the affected node.

      # kubectl label node <node name> node-role.kubernetes.io/control-plane=

    5. Delete the application pods which are already scheduled on the control plane node to get scheduled on new worker nodes.

      # kubectl delete pod <pod name>

    Workaround 2

    1. Add the environment variable with SKIP_NODE_DELETION=true.

      # kubectl set env daemonset vsphere-cloud-controller-manager -n kube-system SKIP_NODE_DELETION=true

    2. Verify whether the environment variable has been applied correctly.

      # kubectl describe daemonset vsphere-cloud-controller-manager -n kube-system

    3. Terminate the running pods. The next pod that you create will pull the environment variable.

      # kubectl delete pod <pod name>

    4. Wait for the pod to start.

    5. View logs with `kubectl logs [POD_NAME] -n kube-system`, and confirm if everything is healthy.

    Note: If you use the second method, it might result in leftover nodes and might introduce unexpected behaviors.

  • When a Kubernetes worker node shuts down non-gracefully, pods on that node remain in Terminating state

    Pods will not be rescheduled to other healthy worker nodes. As a result, the application might face a downtime or run in degraded mode. This depends on the number of replicas of the application present on the worker node that experiences non-graceful shut down.

    Workaround: Forcefully delete the pods that remain in terminating state.

  • After recovery from a network partition or host failure, pods might remain in containerCreating state

    During a network partition or host failure, CPI might delete the node from the Kubernetes cluster if a VM is not found. After recovery, the nodes might not be automatically added back to the cluster. This results in pods remaining in containerCreating state with the error message "Volume not attached according to node status for volume" or "".

    Workaround:

    If the issue occurs on a control plane node, perform the following steps.

    1. Restart kubelet service within the affected node.

      # systemctl restart kubelet

    2. Wait for the node to register with the cluster. Add labels and taints to the affected node.

      # kubectl taint node <node name> node-role.kubernetes.io/control-plane:NoSchedule

      # kubectl label node <node name> node-role.kubernetes.io/control-plane=

    If the issue affects a worker node, perform the following steps.

    1. Restart kubelet service within the affected node.

      # systemctl restart kubelet

    2. Taint the affected node(s).

      # kubectl taint node <node-name> node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

  • Changes to Datacenter and Port entries in the vsphere-config-secret are not applied until you restart the vSphere Container Storage Plug-in pod

    After you make changes to the Datacenter and Port entries in the vsphere-config-secret, volume life cycle operations fail.

    Workaround: Restart the vsphere-csi-controller deployment pod.

  • Volume life cycle operations might be delayed during vSAN network partitioning

    You can observe some delays in Pod creation and Pod deletion during network partitioning on a vSAN cluster. After vSphere Container Storage Plug-in retries all failed operations, the operations succeed.

    This issue might occur because vCenter Server cannot reach the correct host during network partitioning. The volume fails to be created if the request reaches a host that cannot create the volume. However, during a Kubernetes retry, the volume can be created if it reaches the right host.

    Workaround: None.

  • When you perform various operations on a volume or a node VM, you might observe error messages that appear in vCenter Server

    vCenter Server might display the following error messages:

    • When attaching a volume: com.vmware.vc.InvalidController : "The device '0' is referring to a nonexisting controller '1,001'."

    • When detaching a volume: com.vmware.vc.NotFound : "The object or item referred to could not be found."

    • When resizing a volume: com.vmware.vc.InvalidArgument : "A specified parameter was not correct: spec.deviceChange.device"

    • When updating: com.vmware.vc.Timedout : "Operation timed out."

    • When reconfiguring a VM: com.vmware.vc.InsufficientMemoryResourcesFault : "The available Memory resources in the parent resource pool are insufficient for the operation."

    In addition, you can observe a few less frequent errors for the CSI migration feature specifically in 70u2.

    For update:

    • Cannot find the device '2,0xx', which is referenced in the edit or remove device operation.

    • A general system error occurred: Failed to lock the file: api = DiskLib_Open, _diskPath->CValue() = /vmfs/volumes/vsan:52c77e7d8115ccfa-3ec2df6cffce6713/782c2560-d5e7-0e1d-858a-ecf4bbdbf874/kubernetes-dynamic-pvc-f077b8cd-dbfb-4ba6-a9e8-d7d8c9f4c578.vmdk

    • The operation is not allowed in the current state.

    For reconfigure: Invalid configuration for device '0'.

    Workaround: Most of these errors are resolved after a retry from CSI.

  • A statefulset set replica pod remains in terminating state after you delete the statefulset

    Typically, the problem occurs after you perfrom the following steps:

    1. Create volumes in the Kubernetes cluster using vSphere Cloud Provider (VCP).

    2. Enable the CSIMigration feature flags on kube-controller-manager, kubelet, and install vSphere Container Storage Plug-in.

    3. Enable the csi-migration feature state to migrate the volumes that you previously created using VCP.

    4. Create a statefulset using the migrated volumes and continue to use them in the replica set pods.

    5. When you no longer need the application pods to run in the Kubernetes cluster, perform the delete operation on the statefulset.

    This action might occationally result in replica set pods to remain in terminating state.

    Workaround: Force delete the replica set pods in terminating state:

    kubectl delete pod replica-pod-name --force --grace-period=0

check-circle-line exclamation-circle-line close-line
Scroll to top icon