You can run an application in the entire GPU memory or only in a dedicated partition of the memory. vSphere Bitfusion can allocate a GPU, run an application, and deallocate the GPU with a single CLI command or you can use individual commands to perform the same tasks.

How to run applications

The vSphere Bitfusion client can run artificial intelligence (AI) and machine learning (ML) applications on remote shared GPUs.

You use the run command to allocate GPUs for a single application. The application runs by default in the entire memory resource of the GPUs. All GPUs that are requested by using the run command must be allocated from a single vSphere Bitfusion server, and the server must list the GPUs as separate devices with different PCIe addresses.

The run command performs the following three tasks.
  1. Allocates GPUs from the shared pool.
  2. Starts an application in an environment that can access the GPUs when the application makes CUDA calls.
  3. Deallocates the GPUs when the application closes.
You can start an application in vSphere Bitfusion by using the run command with a mandatory -n num_gpus argument for the number of the GPUs. To distinguish vSphere Bitfusion arguments from applications, you use a double-hyphen separator or place the application within quotes.
  • bitfusion run -n num_gpus other switches -- applications and arguments
  • bitfusion run -n num_gpus other switches "applications and arguments"

For example, the AI application, asimov_i.py, requires two arguments: the number of GPUs and a batch size.

  • When the application expects 1 GPU, run bitfusion run -n 1 -- python asimov_i.py --num_gpus=1 --batchsz=64
  • When the application expects 2 GPUs, run bitfusion run -n 2 -- python asimov_i.py --num_gpus=2 --batchsz=64

How to run applications with partial GPUs

You can run your application in a dedicated partition of a GPU's memory and other applications can use the remaining GPU's memory.

The GPU partitioning arguments are optional run command arguments. You use the arguments to run your application in a partition of a GPU memory.

  • The GPU partitioning process is dynamic. When you start a run command with a partial argument, vSphere Bitfusion allocates a partition before the application runs and deallocates the partition afterwards.
  • The applications that are sharing GPUs concurrently are isolated from each other by using separate client processes, network streams, server processes, and memory partitions.
  • vSphere Bitfusion partitions only the memory of the GPU and not the compute resource. An application is strictly contained to the assigned memory partition, but it can access the complete compute resource, if needed. When the same compute cells are required, the applications compete for compute resources, otherwise the applications run concurrently.

You can specify the partition size in MB or as a fraction of the total GPU memory.

Partitioning GPU memory size by fraction (number > 0.0 and <= 1.0, for example, 0.37)

bitfusion run -n num_gpus -p gpu_fraction -- applications and arguments

Partitioning GPU memory size by MB

bitfusion run -n num_gpus -p MBs_per_gpu -- applications and arguments

GPU partitioning examples

Multiple concurrent applications might use a GPU's computational capacity more efficiently than a single application. There are several ways you can partition the memory of your GPUs.

If you are using inference applications with smaller model sizes or small batches of work, such as number of images, you can run the applications concurrently on partitioned GPUs.

You can perform empirical testing to understand the memory size an application requires. Some applications expand to use all available memory, but they might not achieve better performance beyond a certain threshold.

The following examples presume knowledge of acceptable memory requirements with different batch sizes.

  • When you expect that an application with a batch size of 64 requires no more than 66% of GPU memory, run bitfusion run -n 1 -p 0.66 -- python asimov_i.py --num_gpus=1 --batchsz=64
  • When you expect that an application with a batch size of 32 requires no more than 5461 MB of GPU memory, run bitfusion run -n 1 -m 5461 -- python asimov_i.py --num_gpus=1 --batchsz=32

When you request multiple GPUs, all GPUs allocate the same amount of memory. The fraction size specification must be based on the GPU with the smallest amount of memory.

In the following example, the -p argument requests 33% of the memory of each of the two requested GPUs. The GPUs must physically reside on the same server. If the GPUs are 16 GB devices or if the smallest GPU is a 16 GB device, then approximately 5461 MB is allocated on each GPU. While no other applications are running, asimov_i.py can access the full compute power of the two GPUs.

Run bitfusion run -n 2 -p 0.33 -- python asimov_i.py --num_gpus=1 --batchsz=64

You can run multiple applications from a single client on the same GPU concurrently.

For example, to start two concurrent application instances in the background, run both these commands.
  1. bitfusion run -n 1 -p 0.66 -- python asimov_i.py --num_gpus=1 --batchsz=64 &
  2. bitfusion run -n 1 -p 0.33 -- python asimov_i.py --num_gpus=1 --batchsz=32 &

NVIDIA system management interface (nvidia-smi)

You can run the NVIDIA System Management Interface nvidia-smi monitoring application, for example, to check your GPU partition size or verify the resources available on a vSphere Bitfusion server. Typically, the application is provided on the server when you install the NVIDIA driver.

Applications that run on the vSphere Bitfusion clients do not require the NVIDIA driver, but might require the nvidia-smi application, for example, to understand the capabilities of the GPU or to determine the GPU memory sizing. To support such operations, since vSphere Bitfusion 3.0, the nvidia-smi application is provided on all vSphere Bitfusion clients. vSphere Bitfusion copies the application from the server to the client.

For example, to request and check a 1024 MB partition on a GPU, run bitfusion run -n 1 -m 1024 -- nvidia-smi.

The output of the nvidia-smi application displays the requested partition value of 1024MiB.

Requested resources:
Server List: 172.16.31.241:56001
Client idle timeout: 0 min
Wed Sep 23 15:21:17 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:13:00.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB /  1024MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

How to run applications with reserved GPUs

You can allocate GPUs and run multiple applications on the same GPUs.

While the run command allocates GPUs, runs applications, and deallocates GPUs collectively, vSphere Bitfusion has three individual commands to perform the same tasks.

  • To allocate GPUs, run the request_gpus command.
  • To start applications, run the client command.
  • To deallocate the GPUs, run the release_gpus command.

By using the individual commands, you can use the same GPU for multiple applications and have greater control when you are integrating vSphere Bitfusion into other tools and workflows, such as the scheduling software, SLURM.

The run command encapsulates the request_gpus, client, and release_gpus commands. The arguments of the run command are split between the request_gpus and client commands.
Note: The request_gpus command creates a file and environment variables that can be forwarded to other tools. These tools can run the client command with the same allocation configuration.

To understand the use of the individual commands, see the following example workflow that is using the AI application asimov_i.py.

  1. To allocate GPUs to start multiple and sequential applications, run bitfusion request_gpus -n 1 -m 5461.
    Requested resources:
    Server List: 172.16.31.241:56001
    Client idle timeout: 0 min
  2. To start an application by running the client command, run bitfusion client nvidia-smi.
    Wed Sep 23 15:26:02 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.100      Driver Version: 440.64.00    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            Off  | 00000000:13:00.0 Off |                    0 |
    | N/A   36C    P8    10W /  70W |      0MiB /  5461MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+                                        	|
    
  3. To start sequentially another application on the same GPUs by running the client command, run bitfusion client -- python asimov_i.py --num_gpus=1 --batchsz=64.
  4. To deallocate the GPUs , run bitfusion release_gpus.

How to run applications on specific GPUs and servers

Since vSphere Bitfusion 4.0, you can use CLI command arguments to filter the GPUs in your resource pool and start your applications on a specific set of GPUs.

You can use a --filter argument with the run, request_gpus, and list_gpus commands and run the commands with a specific set of GPUs or servers. You can also combine filters to list servers and GPUs that satisfy multiple conditions. For each data type, you must use an appropriate operator, such as <, >, >=, <=, =, or !=.

Table 1. List of Available GPU and Server Filters
Filter Data Type Description
device.index Comma Separated Integer List The system index of a GPU. For example, --filter "device.index=1" or --filter "device.index=0,1,2,3". To see the indices of your GPUs, run the nvidia-smi command.
device.name String The model name of a GPU device. For example, --filter "device.name=NVIDIA Tesla T4".
device.memory Integer The physical memory size of a GPU device in MB. For example, enter --filter "device.memory=16384" for a GPU device with 16 GB memory size.
device.capability Version The NVIDIA device CUDA compute capability. The CUDA compute capability is a mechanism that NVIDIA uses with the CUDA API to specify the features that your GPUs support. The value must be entered in a X or X.Y format. For example, --filter "device.capability=8.0". For more information, see the NVIDIA CUDA GPUs documentation.
server.addr String The IP address of a vSphere Bitfusion server. For example, --filter "server.addr=10.202.8.209".
server.hostname String The hostname of a vSphere Bitfusion server. For example, --filter "server.hostname=bitfusion-server-4.0.0-13".
server.has-rdma Boolean The vSphere Bitfusion server uses an RDMA network connection. For example, --filter "server.has-rdma=true".
server.cuda-version Version The CUDA version that is installed on a vSphere Bitfusion server. The value must be entered in a X, X.Y, or X.Y.Z format. For example, --filter "server.cuda-version=11.3".
server.driver-version Version The NVIDIA driver version that is installed on a vSphere Bitfusion server. The value must be entered in a X, X.Y, or X.Y.Z format. For example, --filter "server.driver-version=460.73".
For example, to list your GPU devices with memory size greater than 16 GB, run the bitfusion list_gpus --filter "device.memory>16384" command.
For example, to run an AI or ML workload on GPU devices with Ampere GPU microarchitecture only, run the bitfusion run -n 1 --filter "device.capability=8.0" -- workload command. Similarly, to run the workload on GPU devices with Volta GPU microarchitecture only, run the bitfusion run -n 1 --filter "device.capability=7.0" -- workload command.
Note: GPU devices with Ampere GPU microarchitecture have CUDA compute capability equal to CUDA version 8.0 and GPU devices with Volta GPU microarchitecture have CUDA compute capability equal to CUDA version 7.0. For more information, see the NVIDIA CUDA GPUs documentation.

How to run applications with an alternative list of vSphere Bitfusion servers

You can create an alternative vSphere Bitfusion server list and use the run, request_gpus, and list_gpus commands to start your applications on specific GPUs. The alternative list is a subset of the primary server list of GPU servers maintained by vSphere Bitfusion in the ~/.bitfusion/servers.conf file.

vSphere Bitfusion supports IPv4 addresses only.

Procedure

  • Use a servers.conf file or create a server list.
    • To use an alternative servers.conf file of vSphere Bitfusion servers, run bitfusion run --servers value, -s value, where the value argument is the filepath to a servers.conf file.
    • To create an alternative list of vSphere Bitfusion servers, run bitfusion run --server_list value, -l value, where the value argument is a server list in a "ip_address:port;ip_address:port" format.
      Note: Enclose the list within quotes, because a semicolon is used as a separator when you list multiple addresses and the command-line interpreter can parse the list as multiple commands.

How to change the waiting time for available GPUs

When you start an application, by default, a vSphere Bitfusion client waits for up to 30 minutes for enough GPUs to be available.

To modify the default time that a vSphere Bitfusion client waits for available GPUs, you can add the --timeout value, -t value argument to the run and request_gpus commands.

For example, you can define the following values for the value argument.
10 10 seconds
10s 10 seconds
10m 10 minutes
10h 10 hours