The following services are currently configured in High Availability (HA) mode: ZooKeeper, Kafka, Elasticsearch, Postgres, Grafana, and NGINX.
NGINX
Two instances of NGINX will be running at the same time on the control-plane-node-1
and kafkaworker-1
nodes.
If one can't reach the Operational UI from the control-plane-node-1
IP address, another instance will be running and reachable on the "kafkaworker-1
IP address
Kafka Edge
Since Smarts currently doesn't accept multiple endpoints in its configuration, Kafka Edge had to be pinned on the domainmanagers-1
node. As a side effect, if the domainmanagers-1
node goes down the communication between Smarts and VMware Telco Cloud Operations will be down until the "domainmanagers-1
node goes back up
Collectors
domainmanagers
nodes available based on the deployment footprint (1, 2, or 3
domainmanagers
may exist based on the selected deployment). If you need to know where a specific collector is running, go to the
Operation UI >
Administration >
Data Collector >
Edit collector > on the
Node IP field.
Postgres
Postgres database is deployed in HA mode: primary and replica instances running on two different nodes able to float around available elasticworkers
nodes. In case the 'primary' instance goes down or become unavailable the 'replica' instance is promoted to 'primary'.
See the following table for High-Availability (HA) Enabled Services:
Service/Components | Type of HA | Number of HA Nodes | Number of Failures Tolerated | One Instance Down | VM Down |
Web UI (NGINX) | Kubernetes Replicas | At least 2 and at most N, where N equals to number of nodes | N-1, where N equals to number of available replicas | Another replica will be running on another node | Another replica will be running on another node |
Reporting (Grafana) | Kubernetes Replicas | At least 2 and at most N, where N equals to number of nodes | N-1, where N equals to number of available replicas | Another replica will be running on another node | Another replica will be running on another node |
Kafka | Kubernetes StatefulSet | 3+ | So long as active nodes >= default replication factor | Instance will be restarted on another node | Instance will be restarted on another node |
Zookeeper | Kubernetes StatefulSet | 3+ | (n-1)/2 node failures tolerated to maintain a quorum. where n is number of nodes in cluster. | Instance will be restarted on another node | Instance will be restarted on another node |
ElasticSearch | Cluster w/auto replication and leader election k8s StatefulSet configuration |
3 | 1 node (with replication set to > 1) | Yes - k8s will try and restart instance to bring back to desired # of replicas | Yes - worker VM will rejoin cluster upon restart, node will rejoin ES cluster |
Collector Manager | Kubernetes Replicas | 2 nodes, 2 active instances running | N-1, where N equals to number of available replicas | Another replica will be running on another node | Another replica will be running on another node |
NGINX | Kubernetes Replicas | 2 nodes, 2 active instances running | N-1, where N equals to number of available replicas | Another replica will be running on another node | Another replica will be running on another node |
Netflow collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Metric service | Kubernetes Replicas | 2 replicas run on nodes with nodeType equals application arangoworker and kafkaworker |
1 pod or 1 node (has Metric service running on it) failure | Metric Processing will be done by the running Instance of Metric Service. k8s will create a new pod for Metric Service if Arango/Kafka worker node is available and does not have Metric service running on it. | The rest replicas will take over the workload. If there are nodes that allow Metric service and don't have Metric service running, k8s will create a new pod on the node |
Data management service | Kubernetes failover | 1 instance and failover handled by k8s | There is a short period of downtime, which is acceptable considering the data management service function. Failure will be handled by the k8s failover mechanism:
|
The pod failure downtime is about 30 seconds. After the downtime, the k8s will create a new pod | The node failure downtime is about 10 minutes. After the downtime, the k8s will create a new service pod in another node |
API service | Kubernetes Replicas | 2 replicas run on nodes with nodeType equals application arangoworker and kafkaworker |
1 pod or 1 node (has API service running on it) failure | the rest replicas will take over the workload. if there are nodes that allow API service and don't have API service running, k8s will create a new pod on the node | the rest replicas will take over the workload. if there are nodes that allow API service and don't have API service running, k8s will create a new pod on the node |
Viptela SD-WAN collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Velocloud SD-WAN collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Cisco ACI collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Kafka collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running at the time | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Vims Clearwater collector | Kubernetes Replicas managed by Collector-Manager | 2 nodes, although only 1 instance will be running | N, where N equals to number of available replicas | Another replica will be scheduled on another node | Another replica will be scheduled on another node |
Postgres | Kubernetes StatefulSets. One node running Postgres "primary" and the second running Postgres "replica". |
At least 2 and at most N, where N equals to number of nodes | N-1, where N equals to number of available replicas | Another replica will be running on another node. If primary instance is down, Postgres will promote the replica instance to primary. | Another replica will be running on another node. If VM where primary instance is running goes down, Postgres will promote the replica instance to primary. |
Event Service | Kubernetes Replicas | 2 replicas run on nodes (Kafkaworker/Arangoworker) | 1 pod or 1 node (has Event service running on it) failure. | Event Processing and Rest handling will be done by the running Instance of event Service. k8s will create a new pod for Event Service if Arango/Kafka worker node is available and does not have Event service running on it. | Event Processing and Rest handling will be done by the running Instance of event Service. k8s will create a new pod for Event Service if Arango/Kafka worker node is available and does not have Event service running on it. |
Persistence Service | Kubernetes Replicas | 2 replicas run on nodes (Kafkaworker/Arangoworker) | 1 pod or 1 node (hasPersistence service running on it) failure. | Processing will be done by the running Instance of Persistence Service. k8s will create a new pod for Persistence Service if Arango/Kafka worker node is available and does not have Persistence service running on it. | Processing will be done by one of the running instance of Persistence Service. k8s will create a new pod for Persistence Service if Arango/Kafka worker node is available and does not have Persistence service running on it. |
DM Adapter | Kubernetes Replicas | 2 replicas run on nodes (Kafkaworker/Arangoworker) | 1 pod or 1 node (has DM Adapter service running on it) failure. | Rest Requests will be handled by the running Instance of DM Adapter service. k8s will create a new pod for DM-Adapter Service if Arango/Kafka worker node is available and does not have DM Adapter service running on it. | Rest handling will be done by the running instance of DM Adapter service. k8s will create a new pod for DM Adapyer Service if Arango/Kafka worker node is available and does not have DM Adapter service running on it. |
Catalog Service | Kubernetes Replicas | 2 replicas run on nodes (Kafkaworker/Arangoworker) | 1 pod or 1 node (has Catalog service running on it) failure. | Rest Requests will be handled by the running Instance of Catalog service. k8s will create a new pod for Catalog Service if Arango/Kafka worker node is available and does not have Catalog service running on it. | Rest handling will be done by the running instance of Catalog service. k8s will create a new pod for Catalog Service if Arango/Kafka worker node is available and does not have Catalog service running on it. |
ESDBProxy | Kubernetes Replicas | 2 replicas run on nodes with nodeType equals application (Kafkaworker/Arangoworker) |
1 pod or 1 node (has ESDBProxy service running on it) failure | The available replica will take over the workload. if there are nodes that allow ESDBProxy service and don't have ESDBProxy service running, k8s will create a new pod on that node. | The replicas will take over the workload. if there are nodes that allow ESDBProxy service and don't have ESDBProxy service running, k8s will create a new pod on the node |