If you chose not to download and install the NVIDIA driver, CUDA library, and the NVIDIA Fabric Manager during the initial boot of the vSphere Bitfusion server virtual machine, you must install the software manually.

There are three different installation methods of the NVIDIA software depending on your vSphere Bitfusion cluster environment.
  • Installation directly from the Internet.
  • Installation in an air-gapped network environment with a local web server.
  • Installation in an air-gapped network environment without a local web server.

Install the NVIDIA Software for vSphere Bitfusion from the Internet

You can manually install the NVIDIA software for your vSphere Bitfusion deployment. Follow this procedure if you chose not to download and install the NVIDIA driver, CUDA library, and NVIDIA Fabric Manager during the initial boot of the vSphere Bitfusion server virtual machine (VM) and your vSphere Bitfusion has access to the Internet.

You can skip this procedure if you chose to download and install the NVIDIA software during the initial boot of the vSphere Bitfusion server VM.

Prerequisites

  • The use of the NVIDIA driver implies acceptance of the NVIDIA Software License Agreement. See License For Customer Use of NVIDIA Software.
  • The NVIDIA driver certified for use with vSphere Bitfusion is NVIDIA-Linux-x86_64-460.73.01.run.
  • The CUDA library that is necessary for NCCL operations and certified for use with vSphere Bitfusion is cuda_11.2.2_460.32.03_linux.run.
  • The NVIDIA Fabric Manager package certified for use with vSphere Bitfusion is nvidia-fabricmanager-460-460.73.01-1.x86_64.rpm.

Procedure

  1. Log in to the appliance shell of the vSphere Bitfusion server VM, where bitfusion_server_IP_address is the IP address of your vSphere Bitfusion server.
    ssh customer@bitfusion_server_IP_address
  2. To install the NVIDIA driver, CUDA library, and NVIDIA Fabric Manager, run the sudo install-nvidia-packages --defaults --yes command.
  3. Restart the VM.

Results

As the vSphere Bitfusion server VM powers on, allow the VM to run for 10 minutes or longer before performing any further configuration tasks or operations. During this time, the vSphere Bitfusion server registers with vCenter Server.

What to do next

Verify That the vSphere Bitfusion Plug-In Registers with vCenter Server

Install the NVIDIA Software in an Air Gapped Network Environment

You can manually install the NVIDIA software in an environment with an air-gapped network. Follow this procedure task if you chose not to download and install the NVIDIA driver, CUDA library, and NVIDIA Fabric Manager during the initial boot of the vSphere Bitfusion server virtual machine (VM) and your vSphere Bitfusion does not have access to the Internet.

You can skip this procedure if you chose to download and install the NVIDIA software during the initial boot of the vSphere Bitfusion server VM.

Prerequisites

Procedure

  1. On a machine with access to the Internet, create and navigate to the nvidia-packages folder.
    mkdir ~/nvidia-packages
    cd ~/nvidia-packages
  2. Download the NVIDIA driver, CUDA library, and NVIDIA Fabric Manager.
    wget http://us.download.nvidia.com/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run
    wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
    wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/nvidia-fabricmanager-460-460.73.01-1.x86_64.rpm
  3. Move and install the NVIDIA software.
    Follow the procedure to install the NVIDIA driver, CUDA library, and NVIDIA Fabric Manager with by using either a local web server, or no web server, as appropriate for your vSphere Bitfusion network environment.
    Option Description
    Option Action
    With a local web server
    1. To copy the NVIDIA software folder to the root directory or a similar directory on the local web server, run the scp following command.
      scp ~/nvidia-packages/* mylogin@mylocalwebserver:/var/www/html/
    2. To log into the local web server, run the mylogin command.
      ssh mylogin@mylocalwebserver@mylocalwebserver
    3. To give read permission to the NVIDIA driver, run the chmod command.
      chmod +r /var/www/html/*
    4. To log into the vSphere Bitfusion server, run ssh customer@bitfusion_server_ip_address.
    5. To install the NVIDIA software from the local web server, run the install-nvidia-packages command.
      sudo install-nvidia-packages --yes --driver http://mylocalwebserver/NVIDIA-Linux-x86_64-460.73.01.run \
          --cuda http://mylocalwebserver/cuda_11.2.2_460.32.03_linux.run \
          --fm http://mylocalwebserver/nvidia-fabricmanager-460-460.73.01-1.x86_64.rpm
    No web server
    1. To copy the NVIDIA software to the vSphere Bitfusion server, run the scp command.
      scp NVIDIA-Linux-x86_64-460.73.01.run customer@bitfusion_server_ip_address:~/
      scp cuda_11.2.2_460.32.03_linux.run customer@bitfusion_server_ip_address:~/
      scp nvidia-fabricmanager-460-460.73.01-1.x86_64.rpm customer@bitfusion_server_ip_address:~/
    2. To log into the vSphere Bitfusion server, run ssh customer@bitfusion_server_ip_address.
    3. To install the NVIDIA software from the local file, run the install-nvidia-packages command.
      sudo install-nvidia-packages --yes --driver NVIDIA-Linux-x86_64-460.73.01.run \
          --cuda cuda_11.2.2_460.32.03_linux.run \
          --fm nvidia-fabricmanager-460-460.73.01-1.x86_64.rpm
  4. Restart the VM.

Results

As the vSphere Bitfusion server VM powers on, allow the VM to run for 10 minutes or longer before performing any further configuration tasks or operations. During this time, the vSphere Bitfusion server registers with vCenter Server.

What to do next

Verify That the vSphere Bitfusion Plug-In Registers with vCenter Server