The following services are currently configured in High Availability (HA) mode: ZooKeeper, Kafka, Elasticsearch, Postgres, Grafana, and NGINX.

NGINX

Two instances of NGINX will be running at the same time on the control-plane-node-1 and kafkaworker-1 nodes.

If one can't reach the Operational UI from the control-plane-node-1 IP address, another instance will be running and reachable on the "kafkaworker-1 IP address

Kafka Edge

Since Smarts currently doesn't accept multiple endpoints in its configuration, Kafka Edge had to be pinned on the domainmanagers-1 node. As a side effect, if the domainmanagers-1 node goes down the communication between Smarts and VMware Telco Cloud Operations will be down until the "domainmanagers-1 node goes back up

Collectors

Collectors will be able to float around domainmanagers nodes available based on the deployment footprint (1, 2, or 3 domainmanagers may exist based on the selected deployment). If you need to know where a specific collector is running, go to the Operation UI > Administration > Data Collector > Edit collector > on the Node IP field.
Note: When the Netflow collector Pod is running in DM1 Node and Pod goes down the collector will be provisioned in the DM2 Node --> user won’t be notified when a node goes down. The user can see where the Netflow collector is running by to the collector UI > Edit Netflow collector > and checking the Node IP information.

Postgres

Postgres database is deployed in HA mode: primary and replica instances running on two different nodes able to float around available elasticworkers nodes. In case the 'primary' instance goes down or become unavailable the 'replica' instance is promoted to 'primary'.

See the following table for High-Availability (HA) Enabled Services:

Service/Components Type of HA Number of HA Nodes Number of Failures Tolerated One Instance Down VM Down
Web UI (NGINX) Kubernetes Replicas At least 2 and at most N, where N equals to number of nodes N-1, where N equals to number of available replicas Another replica will be running on another node Another replica will be running on another node
Reporting (Grafana) Kubernetes Replicas At least 2 and at most N, where N equals to number of nodes N-1, where N equals to number of available replicas Another replica will be running on another node Another replica will be running on another node
Kafka Kubernetes StatefulSet 3+ So long as active nodes >= default replication factor Instance will be restarted on another node Instance will be restarted on another node
Zookeeper Kubernetes StatefulSet 3+ (n-1)/2 node failures tolerated to maintain a quorum. where n is number of nodes in cluster. Instance will be restarted on another node Instance will be restarted on another node
ElasticSearch

Cluster w/auto replication and leader election

k8s StatefulSet configuration

3 1 node (with replication set to > 1) Yes - k8s will try and restart instance to bring back to desired # of replicas Yes - worker VM will rejoin cluster upon restart, node will rejoin ES cluster
Collector Manager Kubernetes Replicas 2 nodes, 2 active instances running N-1, where N equals to number of available replicas Another replica will be running on another node Another replica will be running on another node
NGINX Kubernetes Replicas 2 nodes, 2 active instances running N-1, where N equals to number of available replicas Another replica will be running on another node Another replica will be running on another node
Netflow collector Kubernetes Replicas managed by Collector-Manager 2 nodes, although only 1 instance will be running at the time N, where N equals to number of available replicas Another replica will be scheduled on another node Another replica will be scheduled on another node
Metric service Kubernetes Replicas

2 replicas run on nodes with nodeType equals application

arangoworker and kafkaworker

1 pod or 1 node (has API service running on it) failure the rest replicas will take over the workload. if there are nodes that allow Metric service and don't have Metric service running, k8s will create a new pod on the node the rest replicas will take over the workload. if there are nodes that allow Metric service and don't have Metric service running, k8s will create a new pod on the node
Data management service Kubernetes failover 1 instance and failover handled by k8s

There is a short period of downtime, which is acceptable considering the data management service function. Failure will be handled by the k8s failover mechanism:

  • pod failure: new pod will be created on the same node or a different node
  • node failure: new pod will be created on a different node
The pod failure downtime is about 30 seconds. After the downtime, the k8s will create a new pod The node failure downtime is about 10 minutes. After the downtime, the k8s will create a new service pod in another node
API service Kubernetes Replicas

2 replicas run on nodes with nodeType equals application

arangoworker and kafkaworker

1 pod or 1 node (has API service running on it) failure the rest replicas will take over the workload. if there are nodes that allow API service and don't have API service running, k8s will create a new pod on the node the rest replicas will take over the workload. if there are nodes that allow API service and don't have API service running, k8s will create a new pod on the node
Viptela SD-WAN collector Kubernetes Replicas managed by Collector-Manager 2 nodes, although only 1 instance will be running at the time N, where N equals to number of available replicas Another replica will be scheduled on another node Another replica will be scheduled on another node
Velocloud SD-WAN collector Kubernetes Replicas managed by Collector-Manager 2 nodes, although only 1 instance will be running at the time N, where N equals to number of available replicas Another replica will be scheduled on another node Another replica will be scheduled on another node
Cisco ACI collector Kubernetes Replicas managed by Collector-Manager 2 nodes, although only 1 instance will be running at the time N, where N equals to number of available replicas Another replica will be scheduled on another node Another replica will be scheduled on another node
Kafka collector Kubernetes Replicas managed by Collector-Manager 2 nodes, although only 1 instance will be running at the time N, where N equals to number of available replicas Another replica will be scheduled on another node Another replica will be scheduled on another node
Vims Clearwater collector Kubernetes Replicas managed by Collector-Manager 2 nodes, although only 1 instance will be running N, where N equals to number of available replicas Another replica will be scheduled on another node Another replica will be scheduled on another node
Postgres

Kubernetes StatefulSets.

One node running Postgres "primary" and the second running Postgres "replica".

At least 2 and at most N, where N equals to number of nodes N-1, where N equals to number of available replicas Another replica will be running on another node. If primary instance is down, Postgres will promote the replica instance to primary. Another replica will be running on another node. If VM where primary instance is running goes down, Postgres will promote the replica instance to primary.