Intrinsic High Availability

The following services are currently configured in High Availability (HA) mode: ZooKeeper, Kafka, Elasticsearch, Postgres, Grafana, and NGINX.

NGINX

Two instances of NGINX will be running at the same time on the control-plane-node-1 and kafkaworker-1 nodes.

If one can't reach the Operational UI from the control-plane-node-1 IP address, another instance will be running and reachable on the "kafkaworker-1 IP address

Kafka Edge

Since Smarts currently doesn't accept multiple endpoints in its configuration, Kafka Edge had to be pinned on the domainmanagers-1 node. As a side effect, if the domainmanagers-1 node goes down the communication between Smarts and VMware Telco Cloud Operations will be down until the "domainmanagers-1 node goes back up

Collectors

Collectors will be able to float around domainmanagers nodes available based on the deployment footprint (1, 2, or 3 domainmanagers may exist based on the selected deployment). If you need to know where a specific collector is running, go to the Operation UI > Administration > Data Collector > Edit collector > on the Node IP field.

Note: When the Netflow collector Pod is running in DM1 Node and Pod goes down the collector will be provisioned in the DM2 Node --> user won’t be notified when a node goes down. The user can see where the Netflow collector is running by to the collector UI > Edit Netflow collector > and checking the Node IP information.

Postgres

Postgres database is deployed in HA mode: primary and replica instances running on two different nodes able to float around available elasticworkers nodes. In case the 'primary' instance goes down or become unavailable the 'replica' instance is promoted to 'primary'.

See the following table for High-Availability (HA) Enabled Services:


Service/Components	Type of HA	Number of HA Nodes	Number of Failures Tolerated	One Instance Down	VM Down
Web UI (NGINX)	Kubernetes Replicas	At least 2 and at most N, where N equals to number of nodes	N-1, where N equals to number of available replicas	Another replica will be running on another node	Another replica will be running on another node
Reporting (Grafana)	Kubernetes Replicas	At least 2 and at most N, where N equals to number of nodes	N-1, where N equals to number of available replicas	Another replica will be running on another node	Another replica will be running on another node
Kafka	Kubernetes StatefulSet	3+	So long as active nodes >= default replication factor	Instance will be restarted on another node	Instance will be restarted on another node
Zookeeper	Kubernetes StatefulSet	3+	(n-1)/2 node failures tolerated to maintain a quorum. where n is number of nodes in cluster.	Instance will be restarted on another node	Instance will be restarted on another node
ElasticSearch	Cluster w/auto replication and leader election k8s StatefulSet configuration	3	1 node (with replication set to > 1)	Yes - k8s will try and restart instance to bring back to desired # of replicas	Yes - worker VM will rejoin cluster upon restart, node will rejoin ES cluster
Collector Manager	Kubernetes Replicas	2 nodes, 2 active instances running	N-1, where N equals to number of available replicas	Another replica will be running on another node	Another replica will be running on another node
NGINX	Kubernetes Replicas	2 nodes, 2 active instances running	N-1, where N equals to number of available replicas	Another replica will be running on another node	Another replica will be running on another node
Netflow collector	Kubernetes Replicas managed by Collector-Manager	2 nodes, although only 1 instance will be running at the time	N, where N equals to number of available replicas	Another replica will be scheduled on another node	Another replica will be scheduled on another node
Metric service	Kubernetes Replicas	2 replicas run on nodes with nodeType equals application arangoworker and kafkaworker	1 pod or 1 node (has API service running on it) failure	the rest replicas will take over the workload. if there are nodes that allow Metric service and don't have Metric service running, k8s will create a new pod on the node	the rest replicas will take over the workload. if there are nodes that allow Metric service and don't have Metric service running, k8s will create a new pod on the node
Data management service	Kubernetes failover	1 instance and failover handled by k8s	There is a short period of downtime, which is acceptable considering the data management service function. Failure will be handled by the k8s failover mechanism: pod failure: new pod will be created on the same node or a different node node failure: new pod will be created on a different node	The pod failure downtime is about 30 seconds. After the downtime, the k8s will create a new pod	The node failure downtime is about 10 minutes. After the downtime, the k8s will create a new service pod in another node
API service	Kubernetes Replicas	2 replicas run on nodes with nodeType equals application arangoworker and kafkaworker	1 pod or 1 node (has API service running on it) failure	the rest replicas will take over the workload. if there are nodes that allow API service and don't have API service running, k8s will create a new pod on the node	the rest replicas will take over the workload. if there are nodes that allow API service and don't have API service running, k8s will create a new pod on the node
Viptela SD-WAN collector	Kubernetes Replicas managed by Collector-Manager	2 nodes, although only 1 instance will be running at the time	N, where N equals to number of available replicas	Another replica will be scheduled on another node	Another replica will be scheduled on another node
Velocloud SD-WAN collector	Kubernetes Replicas managed by Collector-Manager	2 nodes, although only 1 instance will be running at the time	N, where N equals to number of available replicas	Another replica will be scheduled on another node	Another replica will be scheduled on another node
Cisco ACI collector	Kubernetes Replicas managed by Collector-Manager	2 nodes, although only 1 instance will be running at the time	N, where N equals to number of available replicas	Another replica will be scheduled on another node	Another replica will be scheduled on another node
Kafka collector	Kubernetes Replicas managed by Collector-Manager	2 nodes, although only 1 instance will be running at the time	N, where N equals to number of available replicas	Another replica will be scheduled on another node	Another replica will be scheduled on another node
Vims Clearwater collector	Kubernetes Replicas managed by Collector-Manager	2 nodes, although only 1 instance will be running	N, where N equals to number of available replicas	Another replica will be scheduled on another node	Another replica will be scheduled on another node
Postgres	Kubernetes StatefulSets. One node running Postgres "primary" and the second running Postgres "replica".	At least 2 and at most N, where N equals to number of nodes	N-1, where N equals to number of available replicas	Another replica will be running on another node. If primary instance is down, Postgres will promote the replica instance to primary.	Another replica will be running on another node. If VM where primary instance is running goes down, Postgres will promote the replica instance to primary.