NVIDIA GPU Cloud (NGC) is a GPU-accelerated cloud platform optimized for deep learning and scientific computing. To improve the performance of your AI and ML workflows, you can add a PVRDMA adapter to your vSphere Bitfusion client that is running on an NGC.
AI and ML images for NGC contain Mellanox Open Fabrics Enterprise Distribution (MOFED) packages, for RDMA support. Open source rdma-core packages, which are supported by PVRDMA, are not included in these images and both type of packages cannot co-exist on the same container. To enable PVRDMA on your NGC container, you must delete the MOFED libraries and install the open source rdma-core packages.
By following this procedure, you can create a Dockerfile that can build an image, which you can run as a container. The Dockerfile example creates an image that is based on an NVIDIA container image for TensorRT, which includes an Ubuntu Linux distribution and the open source rdma-core packages for the Ubuntu distribution.
Prerequisites
- Verify that your vSphere environment has a PVRDMA distributed network configured. For more information, see the vSphere Networking documentation.
- Verify that you have configured a vSphere Bitfusion server to use PVRDMA. See How to configure a vSphere Bitfusion server to use PVRDMA.
- Verify you that you have Docker installed.
Procedure
- Obtain the UBUNTU_CODENAME and look for MOFED deb packages in the NVIDIA container image.
You use the NVIDIA container image as a base image for your Dockerfile.
- In Docker, start the NGC container by running the
sudo docker run -it --rm -u root --ipc=host --privileged --net=host --cap-add=IPC_LOCK --pid=host NGC_Containter
command, where NGC_Containter is the name of URL of the NVIDIA container image for TensorRT. For example, nvcr.io/nvidia/tensorrt:20.12-py3.
- In the NGC containter, obtain the UBUNTU_CODENAME by running the
cat /etc/os-release
command.
For example, the
UBUNTU_CODENAME for Ubuntu 20.04 is
focal
.
- List the contents of the /opt/mellanox/DEBS/ folder by running the
ls -l /opt/mellanox/DEBS/*
command.
- In the displayed file list, look for MOFED deb packages.
For example,
ibverbs-providers_51mlnx1-1.51246_amd64.deb
,
ibverbs-utils_51mlnx1-1.51246_amd64.deb
,
libibverbs-dev_51mlnx1-1.51246_amd64.deb
, and
libibverbs1_51mlnx1-1.51246_amd64.deb
are MOFED
deb packages.
Note: Different NGC containers might contain different MOFED packages or no MOFED packages at all.
- Create a Dockerfile.
For example,
FROM nvcr.io/nvidia/tensorrt:20.12-py3
.
- Uninstall the MOFED deb packages.
For example, add the
RUN apt-get purge -y ibverbs-providers ibverbs-utils libibverbs-dev libibverbs1
command.
- Install the
rdma-core
packages for Ubuntu.
For example, add the
RUN apt-get update && apt-get install -y --reinstall -t focal rdma-core libibverbs1 ibverbs-providers infiniband-diags ibverbs-utils perftest
command, where
focal is the name of your Ubuntu 20.04 distribution.
- Build the NVIDIA TensorRT sample projects.
For example, add the
WORKDIR /workspace/tensorrt/samples
command.
- Install required Python dependencies for NVIDIA TersorRT.
For example, add the
RUN /opt/tensorrt/python/python_setup.sh
command.
- Install the MNIST dataset.
For example, add the following commands.
WORKDIR /workspace/tensorrt/data/mnist
RUN python download_pgms.py
- Install the vSphere Bitfusion client for Ubuntu.
For example, add the following commands.
WORKDIR /workspace
RUN wget https://packages.vmware.com/bitfusion/ubuntu/20.04/bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb
RUN apt-get update && DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get install -y ./bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb
This is an example of a complete Dockerfile.
FROM nvcr.io/nvidia/tensorrt:20.12-py3
RUN apt-get purge -y ibverbs-providers ibverbs-utils libibverbs-dev libibverbs1
RUN apt-get update && apt-get install -y --reinstall -t focal rdma-core libibverbs1 ibverbs-providers infiniband-diags ibverbs-utils libcapstone3 perftest
WORKDIR /workspace/tensorrt/samples
RUN make -j4
RUN /opt/tensorrt/python/python_setup.sh
WORKDIR /workspace/tensorrt/data/mnist
RUN python download_pgms.py
WORKDIR /workspace
RUN wget https://packages.vmware.com/bitfusion/ubuntu/20.04/bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb
RUN apt-get update && DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get install -y ./bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb
- By using Docker, build the image from the Dockerfile.
For example, run the
sudo docker build -t Dockerimage_name
command, where
Dockerimage_name is the name of the new image.
- Activate the vSphere Bitfusion client.
- (Optional) Run the docker file and target a vSphere Bitfusion server that has a configured PVRDMA adapter.
For example, run the following commands, where
Dockerimage_name is the name of the new image and
BF_Server_IP is the IP address of your
vSphere Bitfusion server.
sudo docker run -it --rm -u root --ipc=host --privileged --net=host --cap-add=IPC_LOCK --pid=host Dockerimage_name
cd /workspace/tensorrt/bin
bitfusion run -n 1 -l BF_Server_IP -- ./sample_mnist
Results
You have created an NGC container image that contains a
vSphere Bitfusion client with enabled PVRDMA for data traffic.