You can monitor your vSphere Bitfusion environment in the vSphere Bitfusion plug-in, the CLI, or by using monitoring plug-ins. You can also download monitoring data of your cluster, servers, and clients.

How to download vSphere Bitfusion monitoring data

You can download monitoring data of your vSphere Bitfusion cluster, servers, and clients in the vSphere Bitfusion Plug-in.

By exporting monitoring data, you can use external tools to review and troubleshoot your vSphere Bitfusion environment. The Download CSV button on each tab in the vSphere Bitfusion plug-in, provides you with a different monitoring data set. You can download monitoring data for the past five minutes, one hour, 24 hours, and 30 days.

Procedure

  1. In the vSphere Client, select Menu (vSphere Client menu icon) > Bitfusion.
  2. Select a time period for the monitoring data.
  3. Download the required monitoring data.
    Option Action
    Cluster data To save the cluster GPU allocation data, on the Cluster tab, click Download CSV.
    Servers data To save the data that is displayed for the selected server and pane, on the Servers tab, click Download CSV.
    Clients data To save the data that is displayed for the selected client and pane, on the Clients tab, click Download CSV.
  4. (Optional) Select a location for the .csv file on your local machine.

Monitor vSphere Bitfusion in the vSphere Bitfusion plug-in

You can view IP addresses, host names, GPU allocation, memory use, and other data of your vSphere Bitfusion cluster, servers, and clients in the vSphere Bitfusion plug-in.

How to monitor your vSphere Bitfusion cluster

You can use the vSphere Bitfusion plug-in to view the following data for your cluster.
  • The IP address of the primary vSphere Bitfusion server. The vSphere Bitfusion plug-in uses the IP for communication.
  • The allocation history of GPUs, shown in the Cluster GPU Allocation chart. The chart covers a range from the last 5 minutes to the last 30 days, the number of GPUs populating the cluster, and the number of GPUs allocated from all vSphere Bitfusion servers.
  • All vSphere Bitfusion servers in the vSphere Bitfusion cluster, including servers that have been deactivated or powered off, shown in the Servers table. Each entry displays a host name, IP address, and the number of the allocated GPUs.
  • All vSphere Bitfusion clients that have run applications on the vSphere Bitfusion servers, shown in the Clients table. Each entry lists a host name, ID, and the number of GPUs currently allocated to the client.

How to monitor your vSphere Bitfusion servers

You can use the vSphere Bitfusion plug-in to view the following data for your servers.
  • All vSphere Bitfusion servers in the vSphere Bitfusion cluster, shown in the Servers table. You can select any server to display the server details. The table displays each server’s host name, IP address, current GPU allocation, and the current health state.
  • A heat map with an entry for each GPU on the server, shown in the Allocation chart. Each cell displays by intensity of color how engaged the GPU is during the selected time interval . The level of engagement is a weighted sum of memory allocation and CUDA cell use.
  • Memory and core use charts, one pair for each GPU. The Memory charts also show the memory capacity.
  • The outgoing and incoming traffic for each network interface.

How to monitor vSphere Bitfusion clients

You can use the vSphere Bitfusion plug-in to view the following data for your clients.

  • All vSphere Bitfusion clients in the vSphere Bitfusion cluster, shown in the Clients table. A new entry appears on the list after a new client runs a vSphere Bitfusion command that requires a server connection for the first time. You can select a client to display the client details. The table displays each client’s host name, ID, current GPU allocation, and version.
  • The GPUs that are allocated to a client, shown in the GPU Assignment chart. A client can run multiple applications, each allocating separate GPUs, but they are displayed together. Allocations of partial GPUs add the fractional value to the sum.

Monitor vSphere Bitfusion in the CLI

By using CLI commands, you can check the shadow memory of a vSphere Bitfusion client, the MTU size of your network, and the network interfaces for error statistics and dropped packet counts.

Shadow memory check

The vSphere Bitfusion client uses a part of its memory space as a shadow memory of the allocated remote GPU memory. The precise amount of memory required on the client host varies between applications. The shadow memory check determines if the host's memory is as large as the GPU memory. For more information about memory requirements, see the System Requirements for vSphere Bitfusion topic in the Installing VMware vSphere Bitfusion.

You can see the amount of memory on your client from the MemTotal line of the pseudo file /proc/meminfo. To calculate the GPU memory, from a GPU server, you can run the bitfusion smi or nvidia-smi command, and add up the memory sizes of all GPUs.

You can add more memory to the vSphere Bitfusion client to meet the requirement. Alternatively, when you run applications, do not allocate more GPUs than you can shadow in the memory of the vSphere Bitfusion client.

MTU size check

The vSphere Bitfusion performance relies on a healthy, low-latency, and high-speed network. Applications perform better when they send a few large packets instead of many small packets. The maximum transfer unit (MTU) check determines whether you have a large (³4K) MTU setting for all high-speed (³10 Gbps) interfaces. Ignore this check for interfaces you do not use with vSphere Bitfusion.

Note: For best performance of applications running under vSphere Bitfusion, set the MTU to 4096 or higher and set vSphere Bitfusion clients to match the MTU size of the deployed vSphere Bitfusion servers. If the MTU is above 1500, enable jumbo frames in the network switches.
To obtain and set the MTU size, see the following examples.
  • To check the MTU size, you can run the ifconfig command.
  • To change the MTU size on network interface enp175s to 4096 bytes, you can run ifconfig enp175s mtu 4096.
For more information on MTUs, see Determine maximum MTU.

Network errors check

You can check the network interfaces for error statistics and dropped packet counts. The files are in the following locations.

/sys/class/net/<interface>/statistics/*errors

/sys/class/net/<interface>/statistics/*dropped

If your network is healthy, the error count between the checks does not increase, new error messages do not occur, and no packets are dropped. The files are zeroed out only after a reboot.

Monitor vSphere Bitfusion by using monitoring plug-ins

Since vSphere Bitfusion 4.0, by using a monitoring application and configuring monitoring plug-ins, you can observe detailed information of the virtual machines (VMs) of your vSphere Bitfusion servers and related services. For example, you can periodically check the load averages and disk space usage of the server VMs.

What is the Monitoring Plug-ins package

Typically, by using a monitoring application on a VM, you can run multiple plug-ins to determine the status of hosts and services in your environment. Since vSphere Bitfusion 4.0, you can monitor the vSphere Bitfusion servers in your cluster by using the Monitoring Plug-ins package, which is pre-installed on all vSphere Bitfusion servers. The package contains more than fifty standard plug-ins for monitoring applications, such as Icinga, Naemon, Nagios, Shinken, Sensu. Each plug-in is a standalone command-line tool which performs a specific type of check.

Some plug-ins allow you to perform local checks of your system metrics, such as load averages, processes, or disk space usage, while others use various network protocols, such as ICMP, SNMP, HTTP, to perform checks remotely. For more information, see the Monitoring Plugins Project documentation.

How to configure the Monitoring Plug-ins package in vSphere Bitfusion

To use the plug-ins in the Monitoring Plug-ins package and run check commands in your monitoring application, first you must configure the check_by_ssh plug-in of your monitoring application to securely connect to the VMs of yourvSphere Bitfusion servers without using a password. Then you add the VMs to the list of monitored hosts and the specific monitoring checks.

  1. Connect to the VM of your vSphere Bitfusion server by using Secure Shell Protocol (SSH) and enable the password of the monitoring account, where IP_BF_VM is the IP of the VM of your vSphere Bitfusion server.
    export bfm_ip=IP_BF_VM
    ssh customer@$bfm_ip
    sudo passwd monitoring
  2. Copy the public key of your monitoring account to the authorized_keys folder.

    scp ~/.ssh/id_rsa.pub monitoring@$bfm_ip:~/.ssh/authorized_keys

  3. In your monitoring application, add the vSphere Bitfusion server VM to the list of monitored hosts.
  4. From the VM of your monitoring application, run the ssh monitoring@$bfm_ip command.

    You must be able to log into the vSphere Bitfusion server VM without using a password. If you are prompted for a password, the public key that is stored in the authorized_keys folder is not the same as the public key of the monitoring account.

  5. To verify that the check_by_ssh plug-in and check commands are working, from the VM of your monitoring application, run the $ /usr/lib/nagios/plugins/check_by_ssh -H $bfm_ip -l monitoring -C '/usr/libexec/check_disk --units GB --critical 15 -p /' command.

    The result must return that the disk space is sufficient because the disk has more than 15 GB of free space. For example, DISK OK - free space: / 36 GB (78% inode=98%);| /=10GB;;34;0;4.

  6. Add monitoring checks by using the check_by_ssh plug-in of your monitoring application.

    You can refer to the documentation of your monitoring application, such as Icinga, Naemon, Nagios, Shinken, or Sensu, for instructions about how to add monitoring checks by using the check_by_ssh plug-in.