Upgrading a Network Function with infra_requirements change might cause workload services to be down. This might be relavant, but not limited, to the following cases:
  • Kernel version change
  • Adding or removing SRIOV devices
The reason is, the new node customization is executed on all nodes at parallels. If the node customization requires node reboot, all the nodes might reboot without granular control, which might cause the pods running inside this nodepool to be down at the same period.

A new CNF upgrade granular control feature is added in this release to avoid pod discruption within a nodepool during CNF upgrade.

To enable granular control for CNF upgrade, execute the following steps:

Procedure

  1. Enable PodDiscruptionBudget in helm chart

    During the CNF upgrade, TCA will try to drain node and granular apply new node customization. It is recommended to create a PodDiscruptionBudget CR in helm chart to make sure only the maxUnavailable number of pods can be down.

    An example of PodDiscruptionBudget is as follows:
    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: zk-pdb
    spec:
      maxUnavailable: 1
      selector:
        matchLabels:
          app: zookeeper
    Note: Do not set maxUnavailable to 0 or minAvailable to 100% as it will block node drain and cause CNF upgrade to halt. Refer to https://kubernetes.io/docs/concepts/workloads/pods/disruptions/.
  2. Update nodepool upgrade strategy
    1. Add label telco.vmware.com/dip-update-strategy: pre-scale-rolling to nodepool to enable this feature on this nodepool.
    2. Update nodepool upgrade strategy, choose the suitable maxSurge and maxUnavailable values.
  3. Scale out nodepool

    Edit nodepool, update replicas, add maxSurge number of nodes into the nodepool. This will enable extra rooms for pods that are drained from a node which is being customized.

  4. Upgrade Network Function
    Trigger the Network Function Upgrade. During NF upgrade, the nodes under that nodepool will be granular customized and each ones will pick up a set of nodes to customize until all nodes are customized.
    batch_size=maxSurge+maxUnavailable
  5. Scale in nodepool

    Edit nodepool, update replicas, and remove maxSurge number of nodes from the nodepool.