vSphere Container Storage Plug-in supports topology-aware volume provisioning with WaitForFirstConsumer. With WaitForFirstConsumer topology-aware provisioning, Kubernetes can make intelligent decisions and find the best place to dynamically provision a volume for a pod. In multi-zone clusters, you can provision volumes in an appropriate zone that can run your pod. You can deploy and scale your stateful workloads across failure domains to provide high availability and fault tolerance.

Prerequisites

Enable topology in the native Kubernetes cluster in your vSphere environment. For more information, see Deploy the vSphere Container Storage Plug-in with Topology.

Procedure

  1. Create a StorageClass with the volumeBindingMode parameter set to WaitForFirstConsumer.
    kind: StorageClass
    apiVersion: storage.k8s.io/v1
    metadata:
      name: topology-aware-standard
    provisioner: csi.vsphere.vmware.com
    volumeBindingMode: WaitForFirstConsumer
  2. Create an application to use the StorageClass created previously.

    Instead of creating a volume immediately, the WaitForFirstConsumer setting instructs the volume provisioner to wait until a pod using the associated PVC runs through scheduling. In contrast with the Immediate volume binding mode, when the WaitForFirstConsumer setting is used, the Kubernetes scheduler drives the decision of which failure domain to use for volume provisioning using the pod policies.

    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: web
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: nginx
      serviceName: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: topology.csi.vmware.com/k8s-zone
                    operator: In
                    values:
                    - zone-a
                    - zone-b
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - nginx
                topologyKey: topology.csi.vmware.com/k8s-zone
          containers:
            - name: nginx
              image: gcr.io/google_containers/nginx-slim:0.8
              ports:
                - containerPort: 80
                  name: web
              volumeMounts:
                - name: www
                  mountPath: /usr/share/nginx/html
                - name: logs
                  mountPath: /logs
      volumeClaimTemplates:
        - metadata:
            name: www
          spec:
            accessModes: [ "ReadWriteOnce" ]
            storageClassName: topology-aware-standard
            resources:
              requests:
                storage: 2Gi
        - metadata:
            name: logs
          spec:
            accessModes: [ "ReadWriteOnce" ]
            storageClassName: topology-aware-standard
            resources:
              requests:
                storage: 1Gi
  3. Verify that the statefulset is in Running state and check that the pods are evenly distributed among the zone-a and zone-b.
    $  kubectl get statefulset
     NAME   READY   AGE
     web    2/2     3m51s
     
    $ kubectl get pods -o wide
    NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE         NOMINATED NODE   READINESS GATES
    web-0                     1/1     Running   0          4m40s   10.244.3.21   k8s-node-2   <none>           <none>
    web-1                     1/1     Running   0          4m12s   10.244.4.25   k8s-node-1   <none>           <none>
     
    $ kubectl get node k8s-node-1 k8s-node-2 --show-labels
    NAME         STATUS   ROLES    AGE  VERSION   LABELS
    k8s-node-1   Ready    <none>   2d   v1.21.1   topology.csi.vmware.com/k8s-region=region-1,topology.csi.vmware.com/k8s-zone=zone-a
    k8s-node-2   Ready    <none>   2d   v1.21.1   topology.csi.vmware.com/k8s-region=region-1,topology.csi.vmware.com/k8s-zone=zone-b
    
    Notice that the PV node affinity rules include at least one domain within zone-a or zone-b depending on whether the selected datastore is local or shared across zones.
    $ kubectl get pv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.claimRef.name}{"\t"}{.spec.nodeAffinity}{"\n"}{end}'
     pvc-2253dc52-a9ed-11e9-b26e-005056a04307    www-web-0    map[required:map[nodeSelectorTerms:[map[matchExpressions:[map[key:topology.csi.vmware.com/k8s-region operator:In values:[region-1]] map[operator:In values:[zone-b] key:topology.csi.vmware.com/k8s-zone]]]]]]
     pvc-22575240-a9ed-11e9-b26e-005056a04307    logs-web-0    map[required:map[nodeSelectorTerms:[map[matchExpressions:[map[key:topology.csi.vmware.com/k8s-zone operator:In values:[zone-b]] map[key:topology.csi.vmware.com/k8s-region operator:In values:[region-1]]]]]]]
     pvc-3c963150-a9ed-11e9-b26e-005056a04307    www-web-1    map[required:map[nodeSelectorTerms:[map[matchExpressions:[map[key:topology.csi.vmware.com/k8s-zone operator:In values:[zone-a]] map[operator:In values:[region-1] key:topology.csi.vmware.com/k8s-region]]]]]]
     pvc-3c98978f-a9ed-11e9-b26e-005056a04307    logs-web-1    map[required:map[nodeSelectorTerms:[map[matchExpressions:[map[key:topology.csi.vmware.com/k8s-zone operator:In values:[zone-a]] map[key:topology.csi.vmware.com/k8s-region operator:In values:[region-1]]]]]]]