The appendix aggregates all design decisions of the Site Protection and Disaster Recovery for VMware Cloud Foundation validated solution. You can use this design decision list for reference related to the end state of the environment and potentially to track your level of adherence to the design and any justification for deviations.

Deploy Specification

Table 1. Design Decisions on Site Recovery Manager​ Deployment

Decision ID​

Design Decision​

Design Justification​

Design Implication​

SPR-SRM-CFG-001

Deploy Site Recovery Manager as a virtual appliance.

Allows you to orchestrate the recovery of the VMware Cloud Foundation management components in another VMware Cloud Foundation instance.

None.

SPR-SRM-CFG-002

Deploy each Site Recovery Manager instance in the management domain.

Provides a consistent deployment model for all management applications.

None.

Table 2. Design Decisions on Site Recovery Manager​ Sizing

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-CFG-003

Deploy the Site Recovery Manager virtual appliance using the Light deployment type.

Provides highest level of availability by protecting the management components. This size further accommodates the following setup:​

  • The number of protected management virtual machines as defined in Management workloads with failover support ​​​​

  • Three protection groups​

  • Three recovery plans

None

Table 3. Design Decisions on Replication Technology

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-CFG-004

Use vSphere Replication in Site Recovery Manager as the protection method for virtual machine replication.

  • Allows for flexibility in storage usage and vendor selection between the two disaster recovery VMware Cloud Foundation instances.

  • Minimizes administrative overhead required to maintain Storage Replication Adapter compatibility between two VMware Cloud Foundation instances of disaster recovery.

  • All management components must be in the same cluster.

  • The total number of virtual machines configured for protection using vSphere Replication is reduced compared with the use of storage-based replication.

Table 4. Design Decisions on vSphere Replication Deployment and Sizing

Decision ID

Design Decision

Design Justification

Design Implication

SPR-VR-CFG-001

Deploy each vSphere Replication appliance in the vCenter Server it will be registered with.

vSphere Replication must be deployed in the vCenter Server it is registered with as it discovers the certificate thumbprint during the OVF deployment via the OVF environment.

None

SPR-VR-CFG-002

Deploy each vSphere Replication appliance using the 4 vCPU size.

Accommodates the replication of the expected number of virtual machines that are a part of the following components:

  • VMware Aria Suite Lifecycle

  • Workspace ONE Access

  • VMware Aria Automation

  • VMware Aria Operations

None.

Network Design

Table 5. Design Decisions on the Network Segments for the Site Recovery Manager and vSphere Replication

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-NET-001

Place the Site Recovery Manager instances on the management network.

Places the Site Recovery Manager on the same network as the VMware Cloud Foundation components that the appliance must communicate with.

None.

SPR-VR-NET-001

Place the vSphere Replication instances on the management network.

Places the vSphere Replication on the same network as the VMware Cloud Foundation components that the appliance must communicate with.

None.

Table 6. Design Decisions on the IP Addressing for the Site Recovery Manager and vSphere Replication

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-NET-002

Allocate and assign a static IP address to the Site Recovery Manager instances.

Using assigned IP addresses removes the constraints and risks associated with providing and managing DHCP on your management networks.

The use of static IP addresses requires precise IP address management.

SPR-VR-NET-002

Allocate and assign a static IP address to the vSphere Replication instances.

Using assigned IP addresses removes the constraints and risks associated with providing and managing DHCP on your management networks.

The use of static IP addresses requires precise IP address management.

Table 7. Design Decisions on Name Resolution for Site Recovery Manager and vSphere Replication

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-NET-003

Configure both forward (A) and reverse (PTR) DNS records for each Site Recovery Manager instance.

Site Recovery Manager is accessible using a fully qualified domain name.

  • DNS infrastructure services must be available in the environment.

  • You must establish the DNS records (A and PTR) for each Site Recovery Manager instance.

  • Firewalls between Site Recovery Manager instances and each DNS server must allow traffic for DNS.

SPR-VR-NET-003

Configure both forward (A) and reverse (PTR) DNS records for each vSphere Replication instance.

vSphere Replication is accessible using a fully qualified domain name.

  • DNS infrastructure services must be available in the environment.

  • You must establish the DNS records (A and PTR) for each vSphere Replication instance.

  • Firewalls between vSphere Replication instances and each DNS server must allow traffic for DNS.

Table 8. Design Decisions on Time Synchronization for Site Recovery Manager and vSphere Replication

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-NET-004

Configure the Site Recovery Manager instances to use NTP servers rather than using VMTools to synchronize with the ESXi hosts on which it is running.

  • Ensures that Site Recovery Manager has accurate time synchronization.

  • Assists in the prevention of time mismatch between the management components.

  • NTP services must be available in the environment.

  • Firewalls between Site Recovery Manager and each NTP server must allow traffic for NTP.

SPR-VR-NET-004

Configure the vSphere Replication instances to use NTP servers rather than using VMTools to synchronize with the ESXi hosts on which it is running.

  • Ensures that vSphere Replication has accurate time synchronization.

  • Assists in the prevention of time mismatch between the management components.

  • NTP services must be available in the environment.

  • Firewalls between vSphere Replication and each NTP server must allow traffic for NTP.

Life Cycle Management Design

Table 9. Design Decisions on Life Cycle Management of Site Recovery Manager and vSphere Replication

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-LCM-001

Life cycle management of Site recovery Manager is provided using the native tools in the appliance.

Site Recovery Manager is not managed by SDDC Manager.

Deployment, patching, updates, and upgrades of Site Recovery Manager are performed without native automation.

SPR-VR-LCM-001

Life cycle management of vSphere Replication is provided using the native tools in the appliance.

vSphere Replication is not managed by SDDC Manager.

Deployment, patching, updates, and upgrades of vSphere Replication are performed without native automation.

Information Security and Access Design

Table 10. Design Decisions on Identity Management for Site Recovery Manager and vSphere Replication

Decision ID

Design Decision

Justification

Implication

SPR-SRM-SEC-001

Configure a service account in vCenter Server for application-to-application communication from Site Recovery Manager to vSphere. ​This user account must be a member of the vCenter Single Sign-On administrator group.

Provides the following access control features:​

  • Site Recovery Manager accesses vSphere with the required set of permissions to perform disaster recovery failover orchestration and site pairing.​

  • In the event of a compromised account, the accessibility in the destination application remains restricted.​

  • You can introduce improved accountability in tracking request-response interactions between the components of the SDDC.​

You must maintain the service account's life cycle outside of VMware Cloud Foundation to ensure its availability.​

SPR-VR-SEC-002

Configure a service account in vCenter Server for application-to-application communication from vSphere Replication to vSphere. This user account must be a member of the vCenter Single Sign-On administrator group.​

Provides the following access control features:​

  • vSphere Replication accesses vSphere with the required set of permissions that to perform site to site replication of virtual machines.​

  • In the event of a compromised account, the accessibility in the destination application remains restricted.​

  • You can introduce improved accountability in tracking request-response interactions between the components of the SDDC.​

You must maintain the service account's life cycle outside of VMware Cloud Foundation to ensure its availability.​

SPR-VR-SEC-003

Use global permissions when you create the Site Recovery Manager and vSphere Replication service accounts in vCenter Server.​

Simplifies and standardizes the deployment of the service account across all vCenter Server instances in the same vSphere domain.​

  • Provides a consistent authorization layer.​

  • If you deploy more Site Recovery Manager instances, reduces the efforts in connecting them to the vCenter Server instances.

All vCenter Server instances must be in the same vSphere domain.​

Table 11. Design Decisions on Password Policies for vSphere Replication and Site Recovery Manager

Decision ID

Design Decision

Design Justification

Design Implication

SPR-VR-SRM-SEC-004

Configure the password expiration policy for the vSphere Replication and the Site Recovery Manager appliances.

  • You configure the password expiration policy for the vSphere Replication appliance and the Site Recovery Manager appliance to align with the requirements of your organization which might be based on industry compliance standards.

  • The policy is applicable only to the local users for vSphere Replication and Site Recovery Manager.

You can manage the password expiration policy on the vSphere Replication appliance and the Site Recovery Manager appliance by using the virtual appliance console or a Secure Shell (SSH) client.

SPR-VR-SRM-SEC-005

Configure the password complexity policy for the vSphere Replication and the Site Recovery Manager appliances.

  • You configure the password complexity policy for the vSphere Replication appliance and the Site Recovery Manager appliance to align with the requirements of your organization which might be based on industry compliance standards.

  • The policy is applicable only to the local users for vSphere Replication and Site Recovery Manager.

You can manage the password complexity policy on the vSphere Replication appliance and the Site Recovery Manager appliance by using the virtual appliance console or a Secure Shell (SSH) client.

SPR-VR-SRM-SEC-006

Configure the account lockout policy for the vSphere Replication and theSite Recovery Manager appliances.

  • You configure the account lockout policy for the vSphere Replication appliance and the Site Recovery Manager appliance to align with the requirements of your organization which might be based on industry compliance standards.

  • The policy is applicable only to the local users for vSphere Replication and Site Recovery Manager.

You can manage the account lockout policy on the vSphere Replication appliance and the Site Recovery Manager appliance by using the virtual appliance console or a Secure Shell (SSH) client.

Table 12. Design Decision on Password Management for vSphere Replication and Site Recovery Manager

Decision ID

Design Decision

Design Justification

Design Implication

SPR-VR-SRM-SEC-007

Change the vSphere Replication appliance and Site Recovery Manager appliance root passwords on a recurring or event-initiated schedule by using the virtual appliance console or a Secure Shell (SSH) client.

By default, the passwords for the vSphere Replication and the Site Recovery Manager root accounts never expire.

You must routinely perform the password change for the root accounts by using the virtual appliance console or a Secure Shell (SSH) client.

SPR-VR-SRM-SEC-008

Change the vSphere Replication appliance and Site Recovery Manager appliance admin account passwords on a recurring or event-initiated schedule by using the virtual appliance console or a Secure Shell (SSH) client.

By default, the passwords for the vSphere Replication and the Site Recovery Manager admin accounts never expire.

You must routinely perform the password change for the admin accounts by using the virtual appliance console or a Secure Shell (SSH) client.

Table 13. Design Decisions on Certificates for Site Recovery Manager and vSphere Replication

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-SEC-009

Replace the default self-signed certificate in each Site Recovery Manager instance with a CA-signed certificate.​

Ensures that all communication to the externally facing Web UI of Site Recovery Manager and cross-product communication are encrypted.​

You must have access to a Public Key Infrastructure (PKI) to acquire certificates.​

SPR-VR-SEC-0010

Replace the default self-signed certificate in each vSphere Replication instance with a CA-signed certificate.​

Ensures that all communication to the externally facing Web UI for vSphere Replication and cross-product communication are encrypted.​

You must have access to a Public Key Infrastructure (PKI) to acquire certificates.​

Recovery Plan Design

Table 14. Design Decisions on the Configuration of Protected Management Components

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-CFG-005​

Use Site Recovery Manager and vSphere Replication together to automate the recovery of the following management components:

  • VMware Aria Suite Lifecycle appliance

  • Clustered Workspace ONE Access

  • VMware Aria Operations

    analytics cluster

  • VMware Aria Automation appliance instances

  • Provides an automated run book for the recovery of the management components in the event of a disaster.

  • Ensures that the recovery of management applications can be delivered in a recovery time objective (RTO) of 4 hours or less.

None.

Table 15. Design Decisions on vSphere Replication Configuration

Decision ID

Design Decision

Design Justification

Design Implication

SPR-VR-CFG-003

Do not activate guest OS quiescing in the policies for the management virtual machines in vSphere Replication.

Not all management virtual machines support the use of guest OS quiescing. Using the quiescing operation might result in an outage.

The replicas of the management virtual machines that are stored in the target VMware Cloud Foundation instance are crash-consistent rather than application-consistent.

SPR-VR-CFG-004

Activate network compression on the management virtual machine policies in vSphere Replication.

  • Ensures the vSphere Replication traffic over the network has a reduced footprint.

  • Reduces the amount of buffer memory used on the vSphere Replication VMs.

To perform compression and decompression of data, vSphere Replication VM might require more CPU resources on the source site as more virtual machines are protected.

SPR-VR-CFG-005

Configure a recovery point objective (RPO) of 15 minutes on the management virtual machine policies in vSphere Replication.

  • Ensures that the management application that is failing over after a disaster recovery event contains all data except any changes prior to 15 minutes of the event.

Any changes that are made up to 15 minutes before a disaster recovery event are lost.

SPR-VR-CFG-006

Configure point-in-time (PIT) instances, keeping 3 copies over a 24-hour period on the management virtual machine policies in vSphere Replication.

Ensures application integrity for the management application that is failing over after a disaster recovery event occurs.

Increasing the number of retained recovery point instances increases the disk usage on the vSAN datastore.

Table 16. Design Decisions on the Startup Order Configuration in Site Recovery Manager

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-RP-001

Use a prioritized startup order for VMware Aria Suite Lifecycle and the clustered Workspace ONE Access nodes.

  • Ensures that the VMware Aria Suite Lifecycle is started in such an order that the life cycle management services are restored after a disaster.

  • Ensures that the clustered Workspace ONE Access is started in such an order that the authentication services are restored after a disaster.

  • Ensures that the VMware Aria Suite Lifecycle, and the clustered Workspace ONE Access services are restored in the target of 4 hours.

You must have VMware Tools running on VMware Aria Suite Lifecycle and each of the clustered Workspace ONE Access nodes.

SPR-SRM-RP-002

Use a prioritized startup order for VMware Aria Operations analytics cluster nodes.

Ensures that the individual nodes in the VMware Aria Operations analytics cluster are started in such an order that the operational monitoring services are restored after a disaster.

  • You must have VMware Tools running on the VMware Aria Operations analytics cluster nodes.

  • You must maintain the customized recovery plan if you increase the number of analytics nodes in the VMware Aria Operations cluster.

SPR-SRM-RP-003

Use a prioritized startup order for VMware Aria Operations remote collector nodes.

  • Ensures that the VMware Aria Operations remote collectors are started in such an order that the operational monitoring services are restored after a disaster.

  • You must have VMware Tools running on the VMware Aria Operations analytics cluster nodes.

  • You must maintain the customized recovery plan if you increase the number of remote collector nodes in the VMware Aria Operations cluster.

SPR-SRM-RP-004

Use a prioritized startup order for VMware Aria Automation nodes.

  • Ensures that the individual nodes within VMware Aria Automation are started in such an order that cloud automation services are restored after a disaster.

  • Ensures that the VMware Aria Automation services are restored within the target of 4 hours.

You must have VMware Tools installed and running on each VMware Aria Automation node.

Table 17. Design Decisions on Testing Recovery

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-RP-005

Do not run test recovery of recovery plans.

Because the protected applications use an NSX load balancer, it is not possible to bring the applications online in an isolated test network.

DNS resolution is also unavailable in an isolated test network.

You cannot test disaster recovery without impacting the running production applications.

Failover Design for SDDC Management Components

Table 18. Design Decisions on Name Resolution for the Clustered Workspace ONE Access Instance and the VMware Aria Suite Products for Failover

Decision ID

Design Decision

Design Justification

Design Implication

SPR-DNS-NET-001

In an environment with multiple VMware Cloud Foundation instances, configure the DNS settings for each protected component to use the DNS servers across all VMware Cloud Foundation instances.

Each protected component can resolve DNS from DNS servers during a planned migration or disaster recovery between VMware Cloud Foundation instances.

As you scale from a single VMware Cloud Foundation instance to multiple VMware Cloud Foundation instances, you must update the DNS settings on each protected component.

Table 19. Design Decisions on Time Synchronization for the Clustered Workspace ONE Access Instance and the VMware Aria Suite Products for Failover

Decision ID

Design Decision

Design Justification

Design Implication

SPR-NTP-NET-001

In an environment with multiple VMware Cloud Foundation instances, configure the NTP settings for each protected component to use the NTP servers across all VMware Cloud Foundation instances.

Each protected component can resolve NTP from NTP servers during a planned migration or disaster recovery between VMware Cloud Foundation instances.

As you scale from a single VMware Cloud Foundation instance to multiple VMware Cloud Foundation instances, you must update the NTP settings on each protected component.

Solution Interoperability

Table 20. Design Decisions on Management Packs for VMware Aria Operations with Site Protection and Disaster Recovery

Design Decision ID

Design Decision

Design Justification

Design Implication

SPR-VROPS-CFG-001

Install the Site Recovery Manager management pack for VMware Aria Operations.

Establishes the communication between VMware Aria Operations and VMware Site Recovery Manager endpoints.

You must install the management pack manually.

SPR-VROPS-CFG-002

Install the vSphere Replication management pack for VMware Aria Operations.

Establishes the communication between VMware Aria Operations and VMware vSphere Replication endpoints.

You must install the management pack manually.

SPR-VROPS-CFG-003

Configure the following endpoints to use the remote collector group:

  • Site Recovery Manager

  • vSphere Replication

Local-instance components are configured to use the remote collector group. This offloads data collection for local management components from the analytics cluster.

None.

Table 21. Design Decisions on Service Accounts for VMware Aria Operations Management Packs for Site Protection and Disaster Recovery

Design Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-SEC-003

Configure a service account in vCenter Server with global permissions, for application-to-application communication from the Site Recovery Manager adapters in VMware Aria Operations to vSphere and Site Recovery Manager, and assign the Read Only role.

Provides the following access control features:

  • The adapters in VMware Aria Operations access vSphere and Site Recovery Manager with the minimum set of permissions that are required to collect metrics.

  • In the event of a compromised account, the accessibility in the destination application remains restricted.

  • You can introduce improved accountability in tracking request-response interactions between the components of the SDDC.

You must maintain the life cycle and availability of the service account outside of the SDDC stack.

SPR-VR-SEC-003

Configure a service account in vCenter Server with global permissions, for application-to-application communication from the vSphere Replication adapters in VMware Aria Operations to vSphere and vSphere Replication, and assign the VRM replication viewer role.

Provides the following access control features:

  • The adapters in VMware Aria Operations access vSphere and vSphere Replication with the minimum set of permissions that are required to collect metrics.

  • In the event of a compromised account, the accessibility in the destination application remains restricted.

  • You can introduce improved accountability in tracking request-response interactions between the components of the SDDC.

You must maintain the life cycle and availability of the service account outside of the SDDC stack.

Table 22. Design Decisions on Logging Sources for VMware Aria Operations for Logs with Site Protection and Disaster Recovery

Decision ID

Design Decision

Design Justification

Design Implication

SPR-SRM-LOG-001

When using Site Recovery Manager, install and configure the VMware Aria Operations for Logs agent on the Site Recovery Manager appliance.

Simplifies configuration of log sources in the SDDC that are packaged with the VMware Aria Operations for Logs agent.

You must configure the VMware Aria Operations for Logs agent to forward logs to the VMware Aria Operations for Logs VIP.

SPR-SRM-LOG-002

Configure the VMware Aria Operations for Logs agent to transmit logs from the Site Recovery Manager instance to the adjacent VMware Aria Operations for Logs in the VMware Cloud Foundation instance using the VMware Aria Operations for Logs ingestion API, cfapi, on port 9000.

Ensures the transmission of logs from the Site Recovery Manager instance to the adjacent VMware Aria Operations for Logs by using the Ingestion API.

This configuration is unencrypted. To ensure that the transmission of logs from the Site Recovery Manager instance is encrypted using TSL, you must update the configuration on the Site Recovery Manager instance to send logs to VMware Aria Operations for Logs by using the ingestion API, cfapi, on port 9543 by editing the agent configuration (/etc/liagent.ini).

SPR-SRM-LOG-003

When using vSphere Replication, install and configure the VMware Aria Operations for Logs agent on the vSphere Replication appliance.

The VMware Aria Operations for Logs agent is required to collect and transfer logs to the VMware Aria Operations for Logs instances.

You must configure the VMware Aria Operations for Logs agent to forward logs to the VMware Aria Operations for Logs VIP.

SPR-SRM-LOG-004

Configure the VMware Aria Operations for Logs agent to transmit logs from the vSphere Replication instance to the adjacent VMware Aria Operations for Logs in the VMware Cloud Foundation instance using the VMware Aria Operations for Logs ingestion API, cfapi, on port 9000.

Ensures the transmission of logs from the vSphere Replication instance to the adjacent VMware Aria Operations for Logs by using the Ingestion API.

This configuration is unencrypted. To ensure that the transmission of logs from the vSphere Replication instance is encrypted using TSL, you must update the configuration on the vSphere Replication instance to send logs to VMware Aria Operations for Logs by using the ingestion API, cfapi, on port 9543 by editing the agent configuration (/etc/liagent.ini).

SPR-SRM-LOG-005

Configure a dedicated Photon OS agent group and assign the Site Recovery Manager and vSphere Replication FQDNs.

  • Provides a standardized configuration to all VMware Aria Operations for Logs agents in each of the groups.

  • Defines the VMware Aria Operations for Logs agent configuration for log collection and parsing in the context of the SDDC components, such as specific log directories, files, and formats.

Adds minimal load to VMware Aria Operations for Logs.

Table 23. Design Decisions on Name Resolution for VMware Aria Operations for Logs for Multiple VMware Cloud Foundation Instances

Decision ID

Design Decision

Design Justification

Design Implication

SPR-VRLI-NET-001

For all applications that are capable of failing over between VMware Cloud Foundation instances, such as VMware Aria Automation and VMware Aria Operations, when you configure logging, use the FQDN of the VMware Aria Operations for Logs ILB in the protected instance.

Logging continues during a partial failover to a recovery VMware Cloud Foundation instance.

  • If VMware Aria Automation and VMware Aria Operations are failed over to a recovery VMware Cloud Foundation instance and the VMware Aria Operations for Logs cluster is no longer available in the protected VMware Cloud Foundation instance, you must update the A record on the child DNS server to point to the VMware Aria Operations for Logs cluster in the recovery VMware Cloud Foundation instance.

  • You must set ssl=no for the VMware Aria Operations for Logs agents.