Configure Failure Domains

A Failure domain is a logical grouping of NSX Edge nodes within an NSX Edge Cluster. Failure domains compliment auto placement algorithm and guarantee service availability in case of a failure affecting multiple NSX Edge nodes.

In a failure domain, Active and Standby instances of a Tier-1 SR or members of a sub-cluster always run in different failure domains. Without a failure domain, a Tier-1 SR could be auto placed on NSX Edge nodes that are in the same rack. So, if rack1 fails, both active and standby instance of this Tier-1 SR fail as well.

Without Failure Domains configured:

- In an Edge cluster comprising of four Edge nodes (EdgeNode1, EdgeNode2, EdgeNode3, EdgeNode4), any new Tier-1 Gateways in A/S mode are automatically placed in any two of those four Edge Nodes.

- However, high-availability cannot be achieved if Tier-1 A/S is deployed in Rack1 and Tier-2 A/S is deployed in Rack2. If Rack1 fails, Tier-1 A/S on EdgeNode1 and EdgeNode2 are lost as they are in the same failure domain.

With Failure Domains configured:

- EdgeNode1 and EdgeNode2 are configured to be a part of failure domain-1, while EdgeNode3 and EdgeNode4 are in failure domain-2. When a new Tier-1 SR is created and if the active instance of that Tier-1 is hosted on EdgeNode1, then the standby Tier-1 SR will be instantiated in failure domain 2 (EdgeNode3 or EdgeNode4).

- After configuring Failure Domains on an Edge cluster, any new Tier-1 Active/Standby SRs are correctly placed in different Failure Domains.

Procedure

Using the API, create failure domains for the each Edge node that you will add to the stateful A-A cluster, for example, in FailureDomain1 include FD1-EdgeNode1 and FD1-EdgeNode2 and in FailureDomain2 include FD2-EdgeNode3 and FD2-EdgeNode4. Set the parameter preferred_active_edge_services to true for Edge nodes in both failure domains. The preferred_active_edge_services is useful only when a Tier-1 gateway is created in preemptive failover mode.
```
POST /api/v1/failure-domains
{
"display_name": "FD1-EdgeNode1",
"preferred_active_edge_services": "true"
"display_name": "FD1-EdgeNode2",
"preferred_active_edge_services": "true"
}

POST /api/v1/failure-domains
{
"display_name": "FD2-EdgeNode3",
"preferred_active_edge_services": "true"
"display_name": "FD2-EdgeNode4",
"preferred_active_edge_services": "true"
}
```

Using the API, associate each Edge node with the failure domain for the site. First call the GET /api/v1/transport-nodes/<transport-node-id> API to get the data about the Edge node. Use the result of the GET API as the input for the PUT /api/v1/transport-nodes/<transport-node-id> API, with the additional property, failure_domain_id, set appropriately. For example,

GET /api/v1/transport-nodes/<transport-node-id>
Response:
{
    "resource_type": "TransportNode",
    "description": "Updated NSX configured Test Transport Node",
    "id": "77816de2-39c3-436c-b891-54d31f580961",
    ...
}

PUT /api/v1/transport-nodes/<transport-node-id>
{
    "resource_type": "TransportNode",
    "description": "Updated NSX configured Test Transport Node",
    "id": "77816de2-39c3-436c-b891-54d31f580961",
    ...
    "failure_domain_id": "<UUID>",
}

Using the API, configure the Edge cluster to allocate nodes based on failure domain. First call the GET /api/v1/edge-clusters/<edge-cluster-id> API to get the data about the Edge cluster. Use the result of the GET API as the input for the PUT /api/v1/edge-clusters/<edge-cluster-id> API, with the additional property, allocation_rules set appropriately. For example,

GET /api/v1/edge-clusters/<edge-cluster-id>
Response:
{
    "_revision": 0,
    "id": "bf8d4daf-93f6-4c23-af38-63f6d372e14e",
    "resource_type": "EdgeCluster",
    ...
}

PUT /api/v1/edge-clusters/<edge-cluster-id>
{
    "_revision": 0,
    "id": "bf8d4daf-93f6-4c23-af38-63f6d372e14e",
    "resource_type": "EdgeCluster",
    ...
    "allocation_rules": [
        {
            "action": 
                {
                 "enabled": true,
                 "action_type": "AllocationBasedOnFailureDomain"
                }
        }
    ],
}

Results

The NSX Edge nodes are referenced to different failure domains. You can now use them to create a cluster and configure Tier-0 gateway in A-A Stateful HA mode.