NVIDIA GPU Cloud (NGC) is a GPU-accelerated cloud platform optimized for deep learning and scientific computing. To improve the performance of your AI and ML workflows, you can add a PVRDMA adapter to your vSphere Bitfusion client that is running on an NGC.

AI and ML images for NGC contain Mellanox Open Fabrics Enterprise Distribution (MOFED) packages, for RDMA support. Open source rdma-core packages, which are supported by PVRDMA, are not included in these images and both type of packages cannot co-exist on the same container. To enable PVRDMA on your NGC container, you must delete the MOFED libraries and install the open source rdma-core packages.

By following this procedure, you can create a Dockerfile that can build an image, which you can run as a container. The Dockerfile example creates an image that is based on an NVIDIA container image for TensorRT, which includes an Ubuntu Linux distribution and the open source rdma-core packages for the Ubuntu distribution.

Prerequisites

  • Verify that your vSphere environment has a PVRDMA distributed network configured. For more information, see the vSphere Networking documentation.
  • Verify that you have configured a vSphere Bitfusion server to use PVRDMA. See Configure a vSphere Bitfusion Server to Use PVRDMA.
  • Verify you that you have Docker installed.

Procedure

  1. Obtain the UBUNTU_CODENAME and look for MOFED deb packages in the NVIDIA container image.
    You use the NVIDIA container image as a base image for your Dockerfile.
    1. In Docker, start the NGC container by running the sudo docker run -it --rm -u root --ipc=host --privileged --net=host --cap-add=IPC_LOCK --pid=host NGC_Containter command, where NGC_Containter is the name of URL of the NVIDIA container image for TensorRT. For example, nvcr.io/nvidia/tensorrt:20.12-py3.
    2. In the NGC containter, obtain the UBUNTU_CODENAME by running the cat /etc/os-release command.
      For example, the UBUNTU_CODENAME for Ubuntu 20.04 is focal.
    3. List the contents of the /opt/mellanox/DEBS/ folder by running the ls -l /opt/mellanox/DEBS/* command.
    4. In the displayed file list, look for MOFED deb packages.
      For example, ibverbs-providers_51mlnx1-1.51246_amd64.deb, ibverbs-utils_51mlnx1-1.51246_amd64.deb, libibverbs-dev_51mlnx1-1.51246_amd64.deb, and libibverbs1_51mlnx1-1.51246_amd64.deb are MOFED deb packages.
      Note: Different NGC containers might contain different MOFED packages or no MOFED packages at all.
  2. Create a Dockerfile.
    For example, FROM nvcr.io/nvidia/tensorrt:20.12-py3.
    1. Uninstall the MOFED deb packages.
      For example, add the RUN apt-get purge -y ibverbs-providers ibverbs-utils libibverbs-dev libibverbs1 command.
    2. Install the rdma-core packages for Ubuntu.
      For example, add the RUN apt-get update && apt-get install -y --reinstall -t focal rdma-core libibverbs1 ibverbs-providers infiniband-diags ibverbs-utils perftest command, where focal is the name of your Ubuntu 20.04 distribution.
    3. Build the NVIDIA TensorRT sample projects.
      For example, add the WORKDIR /workspace/tensorrt/samples command.
    4. Install required Python dependencies for NVIDIA TersorRT.
      For example, add the RUN /opt/tensorrt/python/python_setup.sh command.
    5. Install the MNIST dataset.
      For example, add the following commands.
      WORKDIR /workspace/tensorrt/data/mnist
      RUN python download_pgms.py
    6. Install the vSphere Bitfusion client for Ubuntu.
      For example, add the following commands.
      WORKDIR /workspace
      RUN wget https://packages.vmware.com/bitfusion/ubuntu/20.04/bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb
      RUN apt-get update && DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get install -y ./bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb
    This is an example of a complete Dockerfile.
    FROM nvcr.io/nvidia/tensorrt:20.12-py3
    
    RUN apt-get purge -y ibverbs-providers ibverbs-utils libibverbs-dev libibverbs1
    
    RUN apt-get update && apt-get install -y --reinstall -t focal rdma-core libibverbs1 ibverbs-providers infiniband-diags ibverbs-utils libcapstone3 perftest
    
    WORKDIR /workspace/tensorrt/samples
    RUN make -j4
    
    RUN /opt/tensorrt/python/python_setup.sh
    
    WORKDIR /workspace/tensorrt/data/mnist
    RUN python download_pgms.py
    
    WORKDIR /workspace
    RUN wget https://packages.vmware.com/bitfusion/ubuntu/20.04/bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb
    RUN apt-get update && DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get install -y ./bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb
  3. By using Docker, build the image from the Dockerfile.
    For example, run the sudo docker build -t Dockerimage_name command, where Dockerimage_name is the name of the new image.
  4. Enable the vSphere Bitfusion client.
  5. (Optional) Run the docker file and target a vSphere Bitfusion server that has a configured PVRDMA adapter.
    For example, run the following commands, where Dockerimage_name is the name of the new image and BF_Server_IP is the IP address of your vSphere Bitfusion server.
    sudo docker run -it --rm -u root --ipc=host --privileged --net=host --cap-add=IPC_LOCK --pid=host Dockerimage_name
    cd /workspace/tensorrt/bin
    bitfusion run -n 1 -l BF_Server_IP -- ./sample_mnist

Results

You have created an NGC container image that contains a vSphere Bitfusion client with enabled PVRDMA for data traffic.

What to do next

Test Your PVRDMA Network Connection in vSphere Bitfusion