This topic describes how and when to scale BOSH jobs in CAPI, and includes details about some key metrics, heuristics, and logs.
To inform some scaling decisions, check current CPU values by running bosh instances --vitals
. The CPU User value corresponds with the system.cpu.user
metric and is scaled by the number of CPUs. For example, on a 4-core api
VM, a cloud_controller_ng
process that is using 100% of a core is listed as using 25% in the system.cpu.user
metric.
The cloud_controller_ng
Ruby process is the primary job in CAPI. It, along with nginx_cc
, powers the Cloud Controller API that all users of Cloud Foundry interact with. In addition to serving external clients, cloud_controller_ng
also provides APIs for internal components within Cloud Foundry, such as Loggregator and Networking subsystems.
The cloud_controller_ng
Ruby process is the primary job in CAPI. It, along with nginx_cc
, powers the Cloud Controller API that all users of VMware Tanzu Application Service for VMs (TAS for VMs) interact with. In addition to serving external clients, cloud_controller_ng
also provides APIs for internal components within TAS for VMs, such as Loggregator and Networking subsystems.
The cloud_controller_ng
web server can either be Thin (Default) or Puma (Experimental). The characteristics of these web servers differ significantly. The Thin web server runs on a single process per VM, therefore scaling should be done horizontally by adding more VMs. Puma allows for multiple processes to run per VM, therefore scaling can be both vertical or horizontal.
When you are determining whether to scale cloud_controller_ng
, look for the following:
Cloud Controller emits the following metrics:
cc.requests.outstanding.gauge
or cc.requests.outstanding
(deprecated)
Puma Workers x Puma Max Threads
.system.cpu.user
system.cpu.user / (1 / Number of CPU Cores)
.cc.vitals.cpu_load_avg
cc.vitals.uptime
is consistently low, indicating frequent restarts (possibly due to memory pressure).cc_db_connection_pool_timeouts_total
indicates that the DB connection pool is too small which caused threads to wait and finally fail a request because no connection could be acquired within the timeout.cc_db_connection_wait_duration_seconds
indicates that threads have to wait for free DB connections.Note The above guidelines for Puma assume that the number of Puma workers is configured to be the same as the number of CPU cores on the VM - see Scaling Puma Web Server below. If you have configured Puma workers to be different than the number of CPU cores, you should adjust threshold calculations accordingly.
The following behaviors may occur:
bosh is --ps --vitals
has elevated CPU usage for the cloud_controller_ng
job in the API instance group.You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.log
/var/vcap/sys/log/cloud_controller_ng/nginx-access.log
Before and after you scale Cloud Controller API VMs, verify that the Cloud Controller database is not overloaded. All Cloud Controller processes are backed by the same database, so heavy load on the database impacts API performance regardless of the number of Cloud Controllers deployed. Cloud Controller supports both PostgreSQL and MySQL, so there is no specific scaling guidance for the database.
In TAS for VMs deployments with internal MySQL clusters, a single MySQL database VM with CPU usage over 80% can be considered overloaded. When this happens, the MySQL VMs must be scaled up to prevent the added load of additional Cloud Controllers exacerbating the issue.
Note Since Cloud Controller supports both PostgreSQL and MySQL external databases, there is no absolute guidance about what a healthy database looks like. In general, high database CPU utilization is a good indicator of scaling issues, but always defer to the documentation specific to your database.
When deployed with the Thin web server, Cloud Controller API VMs should primarily be scaled horizontally. Scaling up the number of cores on a single VM is not effective. This is because Thin operates in a single process and Ruby’s Global Interpreter Lock (GIL) limits the cloud_controller_ng
process so that it can only effectively use a single CPU core on a multi-core machine.
When deployed with the Puma web server, Cloud Controller API VMs can be scaled both vertically and horizontally. Puma supports multiple processes per VM and so can scale with the number of CPU cores available.
Puma web server exposes the following configuration options. For more information see Cloud Controller Configuration:
cc.puma.workers
- Number of workers for Puma webserver.cc.puma_max_threads
- Maximum number of threads per Puma webserver worker.cc.puma.max_db_connections_per_process
- Maximum database connections for Puma per process (main + Puma workers), if not set the ccng value is used (default).The following are loose recommendations for setting these parameters:
Number of VMs x Number of Workers x Max DB Connections Per Process
. For example, if you had 2 VMs with 4 workers and 10 max DB connections per process, you may have up to 80 DB connections. Be careful this does not exeed any Database limit.If when switching to Puma you reduce the number of overall Cloud Controller API VMs, you may need to increase the number of Cloud Controller Local workers per VM to handle the increased load on the remaining API VMs. For more information, see cloud_controller_worker_local.
If there are any threads which had to wait for a free DB connection (indicated by cc_db_connection_wait_duration_seconds
) it might be worth to increase the DB connection pool if possible. Threads waiting on free DB connections will cause delays in the web server’s responses. * Thin: Connection pool size can be configured with: ccdb.max_connections
* Puma: Connection pool size per process (main + workers) can be configured with: cc.puma.max_db_connections_per_process
In case threads failed to acquire a free DB connection within the configured timeout, cc_db_connection_pool_timeouts_total
will be > 0. Then either the connection pool size or the number of API VMs should be increased.
See Cloud Controller Multi-Process Mode (Puma) for more information on the differences between using Thin and Puma.
This job, also called “local workers”, is primarily responsible for handling files uploaded to the API VMs during cf push
, such as packages
, droplets
, and resource matching.
When determining whether to scale cloud_controller_worker_local
, look for the following:
Cloud Controller emits the following metrics:
cc.job_queue_length.cc-VM_NAME-VM_INDEX
is continuously growing.cc.job_queue_length.total
is continuously growing.The following behaviors may occur:
cf push
is intermittently failing.cf push
average time is elevated.You can find the heuristic failures in the following log files:
/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.log
Because local workers are located with the Cloud Controller API job, they are scaled horizontally along with the API.
Colloquially known as “generic workers” or just “workers”, this job and VM are responsible for handling asynchronous work, batch deletes, and other periodic tasks scheduled by the cloud_controller_clock
.
When determining whether to scale cloud_controller_worker
, look for the following:
Cloud Controller emits the following metrics:
cc.job_queue_length.cc-VM_TYPE-VM_INDEX
(ie. cc.job_queue_length.cc-cc-worker-0
) is continuously growing.cc.job_queue_length.total
is continuously growing.The following behaviors may occur:
cf delete-org ORG_NAME
appears to leave its contained resources around for a long time. Also, the command fails if the -org ORG_NAME
contains shared private domains.You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/cloud_controller_worker/cloud_controller_worker.log
The cc-worker VM can safely scale horizontally in all deployments, but if your worker VMs have CPU/memory headroom, you can also use the cc.jobs.generic.number_of_workers
BOSH property to increase the number of worker processes on each VM.
The cloud_controller_clock
job runs Diego sync process and schedules periodic background jobs. The cc_deployment_updater
job is responsible for handling v3 rolling app deployments. For more information, see Rolling App Deployments.
When determining whether to scale cloud_controller_clock
and cc_deployment_updater
, look for the following:
Cloud Controller emits the following metrics:
cc.Diego_sync.duration
is continuously increasing over time.system.cpu.user
is high on the scheduler VM.The following behaviors may occur:
You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/cloud_controller_clock/cloud_controller_clock.log
/var/vcap/sys/log/cc_deployment_updater/cc_deployment_updater.log
Both of these jobs are singletons, so extra instances are for failover HA rather than scalability. Performance issues are likely due to database overloading or greedy neighbors on the scheduler VM.
This is the internal WebDAV blobstore that comes included with TAS for VMs by default. It stores packages
, staged droplets
, buildpacks
, and cached app resources. Files are typically uploaded to the internal blobstore through the Cloud Controller local workers and downloaded by Diego when app instances are started.
When determining whether to scale blobstore_nginx
, look for the following:
Cloud Controller emits the following metrics:
system.cpu.user
is consistently high on the singleton-blobstore
VM.system.disk.persistent.percent
is high, indicating that the blobstore is running out of room for additional files.The following behaviors may occur:
cf push
is intermittently failing.cf push
average time is elevated.You can find the above heuristic failures in the following log files:
/var/vcap/sys/log/blobstore/internal_access.log
The internal WebDAV blobstore cannot be scaled horizontally, not even for availability purposes, because of its reliance on the singleton-blobstore
VM’s persistent disk for file storage. For this reason, it is not recommended for environments that require high availability. For these environments, you should use an external blobstore. For more information, see Cloud Controller blobstore configuration and the Blob storage section of the High Availability in Cloud Foundry topic.
The internal WebDAV blobstore can be scaled vertically, so scaling up the number of CPUs or adding faster disk storage can improve the performance of the internal WebDAV blobstore if it is under high load.
High numbers of concurrent app container starts on Diego can cause stress on the blobstore. This typically happens during upgrades in environments with a large number of apps and Diego Cells. If vertically scaling the blobstore or improving its disk performance is not an option, limiting the max number of concurrent app container starts can mitigate the issue. For more information, see the starting_container_count_maximum section in the auctioneer job from diego/2.29.0 topic in the BOSH documentation.