Optimizing Kubernetes Cluster Performance

Setting Adequate Quotas in OpenStack

For an OpenStack provider, set quotas that are large enough to accommodate a large cluster.

Table 1. Sample Commands
Command	Description
nova quota-update --key-pairs 500 --instances 500 --cores 4000 --ram 12288000 <tenant_ID>	Set quotas for a 500-node cluster, where each node has 8 vCPUs and 24G RAM
neutron quota-update --tenant-id <tenant_ID> --pool 300 --port 1000 --loadbalancer 300 --floatingip 150	Neutron command to allocate quota according to your network. Port number should be greater than instance plus load balancer number.
cinder quota-update --volumes 500 --gigabytes 5000 <tenant_ID>	Cinder command to allocate quota according to the number of persistent volumes that you want to create.

Best Practices for Creating Large Clusters

To create a large cluster, a best practice is to first create a small cluster, then scale it out. For example, to create a stable 500-node cluster, start by creating a 30-node cluster, then scale it out with a maximum of 30 nodes at a time until you reach 500 nodes.

Tips:

If your cluster is larger than 200 nodes, you might see RPC timeouts in the OpenStack service logs. If that occurs, increase the RPC timeout setting for those services. For example for a Nova service, increase the value of the rpc_response_timeout configuration option in the nova.conf file.
It may take time to refresh the status of created resources when scaling out a cluster. Add the --skip-refresh option to the vkube cluster scaleout command to decrease the deployment time. With this option, the scale out operation does not check the state of existing resources such as VMs or load balancers, and assumes that the resources are successfully created.

Managing High CPU Usage with an OpenStack Provider

If you are using VMware Integrated OpenStack deployed in compact mode as your OpenStack provider, you may notice high CPU usage on the controller or compute service VM’s. If so, increase the number of vCPU’s to 16 per VM.

Addressing Cluster Scale Out Failures

Attempting to scale your cluster may fail. Depending upon the error message that appears, take appropriate action to retry the scale out command.


Error	Action
vSphere error A worker node VM that is powered off or absent can cause a vSphere error. The instance status in OpenStack appears as "not Active".	Delete the abnormal instances from OpenStack: `openstack server delete <server_id>` Rerun the command to scale out the cluster without the `--skip-refresh` option.
configuration error If instance status appears "Active", that indicates a software configuration error.	Rerun the command to scale out the cluster with the `--skip-refresh` option to save deployment time. If the scale out continues to fail, retry with fewer worker nodes.

Alternatives to Load Balancing with NSX-V Backing

When you create services in Kubernetes and you specify the type as LoadBalancer, NSX Edge load balancers are deployed for every service. The load balancer distributes the traffic to all Kubernetes worker nodes up to 32 members. If your Kubernetes cluster includes more that 32 worker nodes, use the Kubernetes Ingress resource instead.

Persistent Volume Claim Management

If you create many persistent volume claims and associated pods in parallel, dynamic provisioning of persistent volumes may fail. If you check the OpenStack service logs and see that the failures are due to RPC timeouts, increase the RPC timeout setting for those services. For example for a Nova service, increase the value of the rpc_response_timeout configuration option in the nova.conf file.