The SASE Orchestrator Disaster Recovery (DR) feature prevents the loss of stored data and resumes SASE Orchestrator services in the event of system or network failure.

SASE Orchestrator DR involves setting up an active/standby SASE Orchestrator pair with data replication and a manually-triggered failover mechanism.
  • The Recovery Time Objective (RTO), therefore, is dependent on explicit action by the operator to trigger promotion of the standby.
  • The Recovery Point Objective (RPO), however, is essentially zero, regardless of the recovery time, because all configuration is instantaneously replicated. Monitoring data that would have been collected during the outage is cached on the Edges and Gateways pending promotion of the standby.
Note: DR is mandatory. For licensing and pricing, contact the VMware sales team for support.

Active/Standby Pair

In a SASE Orchestrator DR deployment, two identical SASE Orchestrator systems are configured as an active / standby pair. The operator can view the state of DR readiness through the web UI on either of the servers. Edges and gateways are aware of both SASE Orchestrators, and while they receive configuration changes only from the active SASE Orchestrator, they periodically send DR heartbeats to both systems to report their view of both servers and to query the DR system status. When the operator triggers a failover, the Edges and Gateways are informed of the change in their next DR heartbeat.

DR States

From the view of an operator, and the Edges and Gateways, a SASE Orchestrator has one of the following four DR states:

DR State Description
Standalone No DR configured.
Active DR configured, acting as the primary SASE Orchestrator server.
Standby DR configured, acting as an inactive replica SASE Orchestrator server.
Zombie DR formerly configured and active but no longer acting as the active or standby.

Run-time Operation

When DR is configured, the standby server runs in a limited mode, blocking all API calls except those related to the DR status and the DR heartbeats. When the operator invokes a failover, the standby is promoted to become fully operational as a Standalone server. The server that was formerly active is automatically transitioned to a Zombie state if it is responsive and visible from the promoted standby. In the Zombie state, management configuration services are blocked and any contact from edges and gateways that have not transitioned to the new active SASE Orchestrator are redirected to the promoted server.

disaster-recovery-replica-and-status

Set Up SASE Orchestrator Replication

Two installed SASE Orchestrator instances are required to initiate replication.

  • The selected standby is put into a STANDBY_CANDIDATE state, enabling it to be configured by the active server.
  • The active server is then given the address and credentials of the standby and it enters the ACTIVE_CONFIGURING state.
When a STANDBY_CONFIG_RQST is made from active to standby, the two servers synchronize through the state transitions.
The two Orchestrators on which Disaster Recovery (DR) need to be established must have same time. Before you initiate SASE Orchestrator replication, ensure you check the following NTP configurations:
  • The Gateway time zone must be set to Etc/UTC. Use the following command to view the NTP time zone.
    vcadmin@vcg1-example:~$ cat /etc/timezone
    Etc/UTC
    vcadmin@vcg1-example:~$ 
    

    If the time zone is incorrect, use the following commands to update the time zone.

    echo "Etc/UTC" | sudo tee /etc/timezone
    sudo dpkg-reconfigure --frontend noninteractive tzdata
    
  • The NTP offset must be less than or equal to 15 milliseconds. Use the following command to view the NTP offset.
    sudo ntpqvcadmin@vcg1-example:~$ sudo ntpq -p
         remote           refid      st t when poll reach   delay   offset  jitter
    ==============================================================================
    *ntp1-us1.prod.v 74.120.81.219    3 u  474 1024  377   10.171   -1.183   1.033
     ntp1-eu1-old.pr .INIT.          16 u    - 1024    0    0.000    0.000   0.000
    vcadmin@vcg1-example:~$ 
    

    If the offset is incorrect, use the following commands to update the NTP offset.

    sudo systemctl stop ntp
    sudo ntpdate <server>
    sudo systemctl start ntp
  • By default, a list of NTP Servers are configured in the /etc/ntpd.conf file. The Orchestrators on which DR need to be established must have Internet to access the default NTP Servers and ensure the time is in sync on both the Orchestrators. Customers can also use their local NTP server running in their environment to sync time.

Set Up the Standby Orchestrator

To set up the Standby Orchestrator, perform the following steps:

  1. In the SD-WAN service of the Enterprise Portal, click Orchestrator tab and then from the left pane click Replication button to display the Orchestrator Replication screen.
  2. Activate the Standby Orchestrator by selecting the Standby (Replication Role) radio button.
  3. Click Enable for Standby button.

    The Standby Orchestrator page appears.

  4. Enter the manual configuration parameters and click Update configuration info button.

    After the Standby Orchestrator has been configured for replication, configure the Active Orchestrator according to the instructions below.

Set Up the Active Orchestrator

To set up the Active Orchestrator, select the Replication Role as Active and configure the following:

Option Description
Select Replication Role Select the Active radio button for the replication role.
Standby Orchestrator Address Enter the primary Standby Orchestrator IP Address.
Standby Orchestrator Address (IPv6) Enter the Standby Orchestrator IPv6 Address.
Standby Orchestrator Secondary Address Enter the address of the standby Orchestrator's secondary interface. This address is used for replication if the standby is promoted to active. Users can add Ipv4/Ipv6 or FQDN address here.
Standby Orchestrator UUID Enter the UUID of the standby Orchestrator.
Configuration Mode Select the Auto Configure Standby or Manually Configure Standby radio button based on the requirement.

When configured manually, paste a string value from ACTIVE VCO to STANDBY_WAIT

.
Superuser Username Enter the display name for the Orchestrator Superuser .
Standby Orchestrator Superuser Password Enter the password for the Orchestrator Superuser.
Note: Starting from the 4.5 release, the use of the special character "<" in the password is no longer supported. In cases where users have already used "<" in their passwords in previous releases, they must remove it to save any changes on the page.
  1. Click Enable for Active button to activate replication role.

When configuration is complete, both Orchestrators (Standby and Active) are in sync.

Standby Orchestrator in Sync

Active Orchestrator in Sync

Test Failover

The following testing failover scenarios are forced failovers for example purposes. You can perform these actions in the Available Actions area of the Active and Standby screens.

Promote a Standby Orchestrator

This section describes how to promote a Standby Orchestrator.

To promote a Standby Orchestrator, perform the following steps:

  1. Click the unlock link.
  2. Click the Promote Standby button in the Available Actions area on the Standby Orchestrator screen.

    The following dialog box appears, indicating that when you promote your Standby Orchestrator, administrators can no longer be able to manage the SASE Orchestrator using the previously Active Orchestrator.

  3. Click the Promote Standby button to promote the Standby Orchestrator.
  4. Click Force Promote Standby to promote the Orchestrator.

    A final dialog box appears indicating that the Orchestrator is no longer a Standby and restarts in Standalone mode.

When you promote a Standby Orchestrator, it restarts in Standalone mode.

If the Standby can communicate with the formerly Active Orchestrator, it instructs that Orchestrator to enter a Zombie state. In Zombie state, the Orchestrator communicates with its clients (edges, gateways, UI/API) that it is no longer active, and that they must communicate with the newly promoted Orchestrator. If the promoted Standby cannot communicate with the formerly Active Orchestrator, the operator should, if possible, manually demote the formerly Active Orchestrator.

Return to Standalone Mode

To return the Zombie to standalone mode, click the Return to Standalone Mode button in the Available Actions area on the Active Orchestrator or Standby Orchestrator screens.

Note:

The Orchestrator can be returned to the Standalone mode from the Zombie state after the time specified in the system property "vco.disasterRecovery.zombie.expirySeconds," which is defaulted to 1800 seconds.

Troubleshooting SASE Orchestrator DR

This section describes the failure states of the system. These are also listed in the UI, along with a more detailed description of the failure. Additional information is available in the VMware log.

Recoverable Failures

The following errors are recoverable failures that can occur after SASE Orchestrator DR reaches an in sync state. If the problem causing these failures is corrected, SASE Orchestrator DR automatically returns to normal operation.

  • FAILURE_SYNCING_FILES
  • FAILURE_GET_STANDBY_STATUS
  • FAILURE_MYSQL_ACTIVE_STATUS
  • FAILURE_MYSQL_STANDBY_STATUS

Unrecoverable Failures

The following failures can occur during configuration of the SASE Orchestrator DR. SASE Orchestrator DR does not automatically recover from these failures.

  • FAILURE_ACTIVE_CONFIGURING
  • FAILURE_LAUNCHING_STANDBY
  • FAILURE_STANDBY_CONFIGURING
  • FAILURE_COPYING_DB
  • FAILURE_COPYING_FILES
  • FAILURE_SYNC_CONFIGURING
  • FAILURE_GET_STANDBY_CONFIG
  • FAILURE_STANDBY_CANDIDATE
  • FAILURE_STANDBY_UNCONFIG
  • FAILURE_STANDBY_PROMOTION
  • FAILURE_ACTIVE_DEMOTION