The following services are currently configured in High Availability (HA) mode: ZooKeeper, Kafka, Elasticsearch, Postgres, Grafana, and NGINX.
NGINX
Two instances of NGINX will be running at the same time on the control-plane-node-1
and kafkaworker-1
nodes.
If one can't reach the Operational UI from the control-plane-node-1
IP address, another instance will be running and reachable on the "kafkaworker-1
IP address
Kafka Edge
Since Smarts currently doesn't accept multiple endpoints in its configuration, Kafka Edge had to be pinned on the domainmanagers-1
node. As a side effect, if the domainmanagers-1
node goes down the communication between Smarts and VMware Telco Cloud Operations will be down until the "domainmanagers-1
node goes back up
Collectors
domainmanagers
nodes available based on the deployment footprint (1, 2, or 3
domainmanagers
may exist based on the selected deployment). If you need to know where a specific collector is running, go to the
Operation UI >
Administration >
Data Collector >
Edit collector > on the
Node IP field.
Postgres
Postgres database is deployed in HA mode: primary and replica instances running on two different nodes able to float around available elasticworkers
nodes. In case the 'primary' instance goes down or become unavailable the 'replica' instance is promoted to 'primary'.
See the following table for High-Availability (HA) Enabled Services:
Service/Components | Type of HA | Number of HA Nodes | Number of Failures Tolerated | One Instance Down | VM Down |
Web UI (NGINX) | Kubernetes Replicas | At least 2 and at most N, where N equals to number of nodes | N-1, where N equals to number of available replicas | Another replica will be running on another node | Another replica will be running on another node |
Reporting (Grafana) | Kubernetes Replicas | At least 2 and at most N, where N equals to number of nodes | N-1, where N equals to number of available replicas | Another replica will be running on another node | Another replica will be running on another node |
Kafka | Kubernetes StatefulSet | 3+ | So long as active nodes >= default replication factor | Instance will be restarted on another node | Instance will be restarted on another node |
Zookeeper | Kubernetes StatefulSet | 3+ | (n-1)/2 node failures tolerated to maintain a quorum. where n is number of nodes in cluster. | Instance will be restarted on another node | Instance will be restarted on another node |
ElasticSearch | Cluster w/auto replication and leader election k8s StatefulSet configuration |
3 | 1 node (with replication set to > 1) | Yes - k8s will try and restart instance to bring back to desired # of replicas | Yes - worker VM will rejoin cluster upon restart, node will rejoin ES cluster |
Collector Manager | Kubernetes Replicas | 2 nodes, 2 active instances running | N-1, where N equals to number of available replicas | Another replica will be running on another node | Another replica will be running on another node |
NGINX | Kubernetes Replicas | 2 nodes, 2 active instances running | N-1, where N equals to number of available replicas | Another replica will be running on another node | Another replica will be running on another node |
Netflow collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Metric service | Kubernetes Replicas | 2 replicas run on nodes with nodeType equals application arangoworker and kafkaworker |
1 pod or 1 node (has API service running on it) failure | the rest replicas will take over the workload. if there are nodes that allow Metric service and don't have Metric service running, k8s will create a new pod on the node | the rest replicas will take over the workload. if there are nodes that allow Metric service and don't have Metric service running, k8s will create a new pod on the node |
Data management service | Kubernetes failover | 1 instance and failover handled by k8s | There is a short period of downtime, which is acceptable considering the data management service function. Failure will be handled by the k8s failover mechanism:
|
The pod failure downtime is about 30 seconds. After the downtime, the k8s will create a new pod | The node failure downtime is about 10 minutes. After the downtime, the k8s will create a new service pod in another node |
API service | Kubernetes Replicas | 2 replicas run on nodes with nodeType equals application arangoworker and kafkaworker |
1 pod or 1 node (has API service running on it) failure | the rest replicas will take over the workload. if there are nodes that allow API service and don't have API service running, k8s will create a new pod on the node | the rest replicas will take over the workload. if there are nodes that allow API service and don't have API service running, k8s will create a new pod on the node |
Viptela SD-WAN collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Velocloud SD-WAN collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Cisco ACI collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Kafka collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Vims Clearwater collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Postgres | Kubernetes StatefulSets. One node running Postgres "primary" and the second running Postgres "replica". |
At least 2 and at most N, where N equals to number of nodes | N-1, where N equals to number of available replicas | Another replica will be running on another node. If primary instance is down, Postgres will promote the replica instance to primary. | Another replica will be running on another node. If VM where primary instance is running goes down, Postgres will promote the replica instance to primary. |