This topic describes events in the lifecycle of a Kubernetes cluster deployed by VMware Tanzu Kubernetes Grid Integrated Edition that can cause temporary service interruptions.
An operator performs a stemcell version update or Tanzu Kubernetes Grid Integrated Edition version update.
kubectl
and the Kubernetes control plane experience a short downtime.None. If the update deploys successfully, the Kubernetes control plane recovers automatically.
A process, such as the scheduler or the Kubernetes API server, crashes on the cluster control plane VM.
None. BOSH brings the process back automatically using monit
. If the process resumes cleanly and without manual intervention, the Kubernetes control plane recovers automatically.
A process, such as Docker or kube-proxy
, crashes on a cluster worker VM.
None. BOSH brings the process back automatically using monit
. If the process resumes cleanly and without manual intervention, the worker recovers automatically, and the scheduler resumes scheduling new pods on this worker.
A process, such as the TKGI API server, crashes on the pivotal-container-service VM.
None. BOSH brings the process back automatically using monit
. If the process resumes cleanly, the TKGI control plane recovers automatically and the TKGI CLI resumes working.
An Tanzu Kubernetes Grid Integrated Edition VM fails and goes offline due to either a virtualization problem or a host hardware problem.
If the BOSH Resurrector is enabled, BOSH detects the failure, recreates the VM, and reattaches the same persistent disk and IP address. Downtime depends on which VM goes offline, how quickly the BOSH Resurrector notices, and how long it takes the IaaS to create a replacement VM. The BOSH Resurrector usually notices an offline VM within one to two minutes. For more information about the BOSH Resurrector, see the BOSH documentation.
If the BOSH Resurrector is not enabled, some cloud providers, such as vSphere, have similar resurrection or high availability (HA) features. Depending on the VM, the impact can be similar to a key process on that VM going down as described in the previous sections, but the recovery time is longer while the replacement VM is created. See the documentation for process failures in the cluster worker, cluster control plane, and TKGI API VM sections for more information.
When the VM comes back online, no further action is required for the developer to continue operations.
An availability zone (AZ) goes offline entirely or loses connectivity to other AZs (net split).
The control plane and clusters are inaccessible. The extent of the downtime is unknown.
When the AZ comes back online, the control plane recovers in one of the following ways:
If BOSH is in a different AZ, BOSH recreates the VMs with the last known persistent disks and IPs. If the persistent disks are gone, the disks can be restored from your last backup and reattached. VMware recommends manually checking the state of VMs and databases.
If BOSH is in the same AZ, follow the directions for region failure.
An entire region fails, bringing all Tanzu Kubernetes Grid Integrated Edition components offline.
The entire Tanzu Kubernetes Grid Integrated Edition deployment and all services are unavailable. The extent of the downtime is unknown.
The TKGI control plane can be restored using BOSH Backup and Restore (BBR). Each cluster may need to be restored manually from backups.
For more information, see Restore Tanzu Kubernetes Grid Integrated Edition Control Plane in Restoring Tanzu Kubernetes Grid Integrated Edition.