Failover and Remediation Techniques

This section details the considerations for backup and fallback options.

Initial Consideration

There are a pair of objectives that an end user should consider prior to establishing a backup plan for their particular workloads:

Restore Point Objective (RPO), or the amount of data loss (if any) which can be acceptably sustained.
Restore Time Objective (RTO), or how much time it can acceptably take (if any downtime is permitted) for a given workload to restore to working order.

Then, various methods, as described in the following sections, can be used to achieve the desired results. For example, a real-time workload with no acceptable level of data loss and zero downtime are the requirements for RPO/RTO, then an active-active application, implemented on a cluster featuring redundant hardware is necessary. And this includes storage with Failures to Tolerate (FTT) ≥ 1.

VMware Recommendations (which also align with NERC CIP):

Document business and technical requirements and use VMware’s Compatibility Guide to help narrow down the list of potential backup solutions.
Conduct a proof-of-concept to validate the solutions which can meet requirements and functionality as desired.
If possible, maintain two copies of backup data — one on-site, another off-site. The on-site copy helps facilitate faster restores. The off-site copy is for disaster recovery.
Perform frequent test restores to validate the backup solution is working properly.
A well-rounded business continuity strategy includes a data protection (backup and restore) solution and a disaster recovery plan.

Data Protection for VMs

Features that are native to VMware vSphere and vSAN can be combined with third-party solutions to build a comprehensive data protection strategy based on specific needs and technical requirements. Solutions are available for on-premises and public or private cloud-based deployments. For example, a third-party product can be used to backup and quickly restore individual VMs while VMware vSphere Replication and Site Recovery provide recovery from site outages when needed.

vSphere Replication:

VMware vSphere Replication (VR) is a hypervisor-based, asynchronous replication solution for vSphere VMs. It is fully integrated with vCenter Server and vSAN. Replication is configured on a per VM basis, allowing precise control over which workloads are protected.

The vSphere HTML5 client is used to configure replication for one or more VMs. The target location for Replication can be within the same vCenter Server environment or in another environment where Replication is deployed. The same vSphere Replication deployment can replicate VMs to a local vCenter Server environment and other VMs to a remote vCenter or to VMware Site Recovery.

The vSphere Replication user interface provides information such as status, last synchronization duration, and size, configured RPO, and which vSphere Replication server is receiving the replicated data. The components that transmit replicated data are built into vSphere. They provide the plug-in interfaces for configuring and managing replication, track the changes to VMDKs, automatically schedule replication to achieve the RPO for each protected VM, and transmit the changed data to one or more vSphere Replication virtual appliances.

Data is transmitted from the source vSphere host to either a vSphere Replication management server or vSphere Replication server and is written to storage at the target location. The replication stream can be encrypted. As data is being replicated, the changes are first written to a file called a redo log, which is separate from the base disk. After all changes for the current replication cycle have been received and written to the redo log, the data in the redo log is consolidated into the base disk. This process helps ensure the consistency of each base disk so VMs can be recovered at any time, even if replication is in progress or network connectivity is lost during transmission.

When configuring replication for a VM, an administrator has the option to enable the retention of multiple recovery points (point-in-time instances). This can be useful when an issue is discovered several hours, or even a few days after it occurred. For example, a replicated VM with a 4-hour RPO, contracts a virus, but the virus is not discovered until six hours after infestation. As a result, the virus has been replicated to the target location. With multiple recovery points, the VM can be recovered and then reverted to a recovery point retained before the virus issue occurred.

Backups with vSAN:

The process and mechanisms used to back up VMs on vSAN are nearly identical to other vSphere datastore types. Backup solutions use the same APIs and snapshots to backup and restore VMs that reside on vSAN. Agents installed in the guest operating system work the same regardless of the underlying hardware (virtual or physical) and storage.

Snapshot as a Backup/Remediation Method:

VM backups are commonly done by creating a snapshot of the VM’s virtual disks to obtain a static image of the VM. The VM continues to process IO with the snapshot in place. Writes to a virtual disk are redirected to a redo log. The changes captured in redo log are consolidated into the virtual disk when the backup of the virtual disk is complete, and the snapshot is removed. This approach enables a quick and clean backup operation for very short-term needs.

An incremental backup mechanism called Changed Block Tracking (CBT) is also provided. CBT tracks disk sectors that change between snapshots. A Change ID is set and incremented each time a snapshot is taken. This provides data protection vendors with the ability to perform full backups (all the data) and incremental backups (changed data since the last backup).

vSAN Sparse Snapshots:

VMware vSphere, using VM snapshots, provides the ability to capture a point-in-time state and data of a VM. This includes the VMs’ storage, memory, and other devices such as virtual NICs. Snapshots are useful for creating a point-in-time state and data of a VM for backup or archival purposes and for creating test and rollback environments for applications. Snapshots can capture VMs that are powered-on, powered-off, or even suspended. When the VM is powered on, there is an option to capture its memory state and allow it to be reverted to a powered-on point in time.

Things to keep in mind while using snapshots:

Typically, snapshots are used temporarily for a point-in-time copy of a virtual disk to provide a quick rollback during a change window. Snapshots are also used by backup tools to allow point-in-time backups without interrupting the normal operation of the VM.
Snapshots never grow larger than the size of the original base disk. However, the size of the delta is dependent on the number of changes made since the snapshot was previously taken.
Proactively monitor the vSAN datastore capacity and read cache consumption on a regular basis when using snapshots intensively on vSAN.
VMware supports the full maximum chain length of 32 snapshots when vsanSparse snapshots are used.
Even with the improved snapshot capabilities with vSAN, the recommendation is to only have a few snapshots for a short duration.

Data Protection for Modern Apps

Example Solution for Backup and Restore for Kubernetes:

Velero (formerly Heptio Ark) provides tools for backup and restore within Kubernetes cluster resources and persistent volumes.

Make backups of clusters and restore them in case of loss.
Migrate cluster resources to other clusters.
Replicate production clusters for development and testing.

Velero consists of:

A server that runs on the cluster
A command-line client that runs locally
Velero can run in clusters on a cloud provider or on-premises. For detailed information, see Compatible Storage Providers.

Each Velero operation -- on-demand backup, scheduled backup, restore -- is a custom resource, defined with a Kubernetes Custom Resource Definition (CRD) and stored in the key value store (etcd). Velero also includes controllers that process the custom resources to perform backups, restores, and all related operations. Every object in the cluster can be backed up and restored within a cluster, or objects can be filtered by type, namespace, and label.

Velero is ideal for the disaster recovery use case, as well as for creating snapshots of an application state, prior to performing system operations, such as upgrades, on a cluster.

Snapshot VM Management

Snapshots preserve the state and data of a VM at the time it is taken. An image of the VM in its given state is copied and stored. Snapshots are useful when a previous VM state is useful to revert to, but it is not desirable or feasible to create multiple VMs.

Multiple snapshots of a VM can be taken to create restoration positions in a linear process. Many positions can be saved this way to accommodate different kinds of work processes. Snapshots operate on individual VMs.

Snapshots are useful as a short-term solution for testing software with unknown or potentially harmful effects. For example, a snapshot can be used as a restoration point during a linear or iterative process, such as installing update packages, or when installing different versions of a program. This method ensures that each installation begins from an identical baseline.

Several operations for creating and managing VM snapshots and snapshot trees are available in the vSphere Client. These enable creation, restoration within the hierarchy snapshot have been taken, and deletion. Snapshot trees offer VM states at any specific time for restoration later. Each branch in a tree can have up to 32 snapshots.

A snapshot preserves the following information:

VM settings. The VM directory, which includes the disks added or changed after the snapshot is taken.
Power state. The VM can be powered on, powered off, or suspended.
Disk state. State of all the VM's virtual disks.
(Optional) Memory state. The contents of the VM's memory.