A Failure domain is a logical grouping of NSX Edge nodes within an NSX Edge Cluster. Failure domains compliment auto placement algorithm and guarantee service availability in case of a failure affecting multiple NSX Edge nodes.
In a failure domain, Active and Standby instances of a Tier-1 SR or members of a sub-cluster always run in different failure domains. Without a failure domain, a Tier-1 SR could be auto placed on NSX Edge nodes that are in the same rack. So, if rack1 fails, both active and standby instance of this Tier-1 SR fail as well.
Without Failure Domains configured:
- In an Edge cluster comprising of four Edge nodes (EdgeNode1, EdgeNode2, EdgeNode3, EdgeNode4), any new Tier-1 Gateways in A/S mode are automatically placed in any two of those four Edge Nodes.
- However, high-availability cannot be achieved if Tier-1 A/S is deployed in Rack1 and Tier-2 A/S is deployed in Rack2. If Rack1 fails, Tier-1 A/S on EdgeNode1 and EdgeNode2 are lost as they are in the same failure domain.
With Failure Domains configured:
- EdgeNode1 and EdgeNode2 are configured to be a part of failure domain-1, while EdgeNode3 and EdgeNode4 are in failure domain-2. When a new Tier-1 SR is created and if the active instance of that Tier-1 is hosted on EdgeNode1, then the standby Tier-1 SR will be instantiated in failure domain 2 (EdgeNode3 or EdgeNode4).
- After configuring Failure Domains on an Edge cluster, any new Tier-1 Active/Standby SRs are correctly placed in different Failure Domains.