When you deploy a deep learning VM in vSphere IaaS control plane by using kubectl or directly on a vSphere cluster, you must fill in custom VM properties.

For information about deep learning VM images in VMware Private AI Foundation with NVIDIA, see About Deep Learning VM Images in VMware Private AI Foundation with NVIDIA.

OVF Properties of Deep Learning VMs

When you deploy a deep learning VM, you must fill in custom VM properties to automate the configuration of the Linux operating system, the deployment of the vGPU guest driver, and the deployment and configuration of NGC containers for the DL workloads.

The latest deep learning VM image has the following OVF properties:

Category Parameter Label in the vSphere Client Description
Base OS Properties instance-id Instance ID Required. A unique instance ID for the VM instance.

An instance ID uniquely identifies an instance. When an instance ID changes, cloud-init treats the instance as a new instance and runs the cloud-init process to again.

hostname Hostname Required. The host name of the appliance.
seedfrom URL to seed instance data from Optional. An URL to pull the value for the user-data parameter and metadata from.
public-keys SSH public key If provided, the instance populates the default user's SSH authorized_keys with this value.
user-data Encoded user-data

A set of scripts or other metadata that is inserted into the VM at provisioning time.

This property is the actual contents of the cloud-init script. This value must be base64 encoded.

password Default user's password Required. The password for the default vmware user account.

vGPU Driver Installation

vgpu-license vGPU license Required. The NVIDIA vGPU client configuration token. The token is saved in the /etc/nvidia/ClientConfigToken/client_configuration_token.tok file.
nvidia-portal-api-key NVIDIA Portal API key

Required in a connected environment. The API key you downloaded from the NVIDIA Licensing Portal. The key is required for vGPU guest driver installation.

vgpu-fallback-version vGPU host driver version Install directly this version of the vGPU guest driver.
vgpu-url URL for air-gapped vGPU downloads

Required in a disconnected environment. The URL to download the vGPU guest driver from. For information on the required configuration of the local Web server, see Preparing VMware Cloud Foundation for Private AI Workload Deployment.

DL Workload Automation registry-uri Registry URI Required in a disconnected environment or if you plan to use a private container registry to avoid downloading images from the Internet. The URI of a private container registry with the deep learning workload container images.

Required if you are referring to a private registry in user-data or image-oneliner.

registry-user Registry username Required if you are using a private container registry that requires basic authentication.
registry-passwd Registry password Required if you are using a private container registry that requires basic authentication.
registry-2-uri Secondary registry URI Required if you are using a second private container registry that is based on Docker and requires basic authentication.

For example, when deploying a deep learning VM with the NVIDIA RAG DL workload pre-installed, a pgvector image is downloaded from Docker Hub. You can use the registry-2- parameters to work around a pull rate limit for docker.io.

registry-2-user Secondary registry username Required if you are using a second private container registry.
registry-2-passwd Secondary registry password Required if you are using a second private container registry.
image-oneliner Encoded one-line command A one-line bash command that is run at VM provisioning. This value must be base64 encoded.

You can use this property to specify the DL workload container you want to deploy, such as PyTorch or TensorFlow. See Deep Learning Workloads in VMware Private AI Foundation with NVIDIA.

Caution: Avoid using both user-data and image-oneliner.
docker-compose-uri Encoded Docker compose file

Required if you need a Docker compose file to start the DL workload container. The contents of the docker-compose.yaml file that will be inserted into the virtual machine at provisioning after the virtual machine is started with GPU enabled. This value must be base64 encoded.

config-json Encoded config.json The contents of a configuration file for adding details for proxy servers. This value must be base64 encoded. See Configure a Deep Learning VM with a Proxy Server.
conda-environment-install Conda Environment Install A comma-separated list of Conda environments to be automatically installed after VM deployment is complete.

Available environments: pytorch2.3_py3.12

Deep Learning Workloads in VMware Private AI Foundation with NVIDIA

You can provision a deep learning virtual machine with a supported deep learning (DL) workload in addition to its embedded components. The DL workloads are downloaded from the NVIDIA NGC catalog and are GPU-optimized and validated by NVIDIA and VMware by Broadcom.

For an overview of the deep learning VM images, see About Deep Learning VM Images in VMware Private AI Foundation with NVIDIA.

CUDA Sample

You can use a deep learning VM with running CUDA samples to explore vector addition, gravitational n-body simulation, or other examples on a VM. See the CUDA Samples page.

After the deep learning VM is launched, it runs a CUDA sample workload to test the vGPU guest driver. You can examine the test output in the /var/log/dl.log file.

Table 1. CUDA Sample Container Image
Component Description
Container image
nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag
For example:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

For information on the CUDA Sample container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a CUDA Sample workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the CUDA Sample image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          set_proxy "http" "https" "socks5"
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          docker run -d $REGISTRY_URI_PATH/nvidia/k8s/cuda-sample:ngc_image_tag
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
        
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }

      For example, for vectoradd-cuda11.7.1-ubi8, provide the following script in base64 format:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICBzZXRfcHJveHkgImh0dHAiICJodHRwcyIgInNvY2tzNSIKICAgIHRyYXAgJ2Vycm9yX2V4aXQgIlVuZXhwZWN0ZWQgZXJyb3Igb2NjdXJzIGF0IGRsIHdvcmtsb2FkIicgRVJSCiAgICBERUZBVUxUX1JFR19VUkk9Im52Y3IuaW8iCiAgICBSRUdJU1RSWV9VUklfUEFUSD0kKGdyZXAgcmVnaXN0cnktdXJpIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKCiAgICBpZiBbWyAteiAiJFJFR0lTVFJZX1VSSV9QQVRIIiBdXTsgdGhlbgogICAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGlzIG51bGwgb3IgZW1wdHksIHVzZSB0aGUgZGVmYXVsdCB2YWx1ZQogICAgICBSRUdJU1RSWV9VUklfUEFUSD0kREVGQVVMVF9SRUdfVVJJCiAgICAgIGVjaG8gIlJFR0lTVFJZX1VSSV9QQVRIIHdhcyBlbXB0eS4gVXNpbmcgZGVmYXVsdDogJFJFR0lTVFJZX1VSSV9QQVRIIgogICAgZmkKICAgIAogICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBjb250YWlucyAnLycsIGV4dHJhY3QgdGhlIFVSSSBwYXJ0CiAgICBpZiBbWyAkUkVHSVNUUllfVVJJX1BBVEggPT0gKiIvIiogXV07IHRoZW4KICAgICAgUkVHSVNUUllfVVJJPSQoZWNobyAiJFJFR0lTVFJZX1VSSV9QQVRIIiB8IGN1dCAtZCcvJyAtZjEpCiAgICBlbHNlCiAgICAgIFJFR0lTVFJZX1VSST0kUkVHSVNUUllfVVJJX1BBVEgKICAgIGZpCiAgCiAgICBSRUdJU1RSWV9VU0VSTkFNRT0kKGdyZXAgcmVnaXN0cnktdXNlciAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBSRUdJU1RSWV9QQVNTV09SRD0kKGdyZXAgcmVnaXN0cnktcGFzc3dkIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIGlmIFtbIC1uICIkUkVHSVNUUllfVVNFUk5BTUUiICYmIC1uICIkUkVHSVNUUllfUEFTU1dPUkQiIF1dOyB0aGVuCiAgICAgIGRvY2tlciBsb2dpbiAtdSAkUkVHSVNUUllfVVNFUk5BTUUgLXAgJFJFR0lTVFJZX1BBU1NXT1JEICRSRUdJU1RSWV9VUkkKICAgIGVsc2UKICAgICAgZWNobyAiV2FybmluZzogdGhlIHJlZ2lzdHJ5J3MgdXNlcm5hbWUgYW5kIHBhc3N3b3JkIGFyZSBpbnZhbGlkLCBTa2lwcGluZyBEb2NrZXIgbG9naW4uIgogICAgZmkKICAgIAogICAgZG9ja2VyIHJ1biAtZCAkUkVHSVNUUllfVVJJX1BBVEgvbnZpZGlhL2s4cy9jdWRhLXNhbXBsZTp2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4CgotIHBhdGg6IC9vcHQvZGx2bS91dGlscy5zaAogIHBlcm1pc3Npb25zOiAnMDc1NScKICBjb250ZW50OiB8CiAgICAjIS9iaW4vYmFzaAogICAgZXJyb3JfZXhpdCgpIHsKICAgICAgZWNobyAiRXJyb3I6ICQxIiA+JjIKICAgICAgdm10b29sc2QgLS1jbWQgImluZm8tc2V0IGd1ZXN0aW5mby52bXNlcnZpY2UuYm9vdHN0cmFwLmNvbmRpdGlvbiBmYWxzZSwgRExXb3JrbG9hZEZhaWx1cmUsICQxIgogICAgICBleGl0IDEKICAgIH0KCiAgICBjaGVja19wcm90b2NvbCgpIHsKICAgICAgbG9jYWwgcHJveHlfdXJsPSQxCiAgICAgIHNoaWZ0CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCiAgICAgIGlmIFtbIC1uICIke3Byb3h5X3VybH0iIF1dOyB0aGVuCiAgICAgICAgbG9jYWwgcHJvdG9jb2w9JChlY2hvICIke3Byb3h5X3VybH0iIHwgYXdrIC1GICc6Ly8nICd7aWYgKE5GID4gMSkgcHJpbnQgJDE7IGVsc2UgcHJpbnQgIiJ9JykKICAgICAgICBpZiBbIC16ICIkcHJvdG9jb2wiIF07IHRoZW4KICAgICAgICAgIGVjaG8gIk5vIHNwZWNpZmljIHByb3RvY29sIHByb3ZpZGVkLiBTa2lwcGluZyBwcm90b2NvbCBjaGVjay4iCiAgICAgICAgICByZXR1cm4gMAogICAgICAgIGZpCiAgICAgICAgbG9jYWwgcHJvdG9jb2xfaW5jbHVkZWQ9ZmFsc2UKICAgICAgICBmb3IgdmFyIGluICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IjsgZG8KICAgICAgICAgIGlmIFtbICIke3Byb3RvY29sfSIgPT0gIiR7dmFyfSIgXV07IHRoZW4KICAgICAgICAgICAgcHJvdG9jb2xfaW5jbHVkZWQ9dHJ1ZQogICAgICAgICAgICBicmVhawogICAgICAgICAgZmkKICAgICAgICBkb25lCiAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2xfaW5jbHVkZWR9IiA9PSBmYWxzZSBdXTsgdGhlbgogICAgICAgICAgZXJyb3JfZXhpdCAiVW5zdXBwb3J0ZWQgcHJvdG9jb2w6ICR7cHJvdG9jb2x9LiBTdXBwb3J0ZWQgcHJvdG9jb2xzIGFyZTogJHtzdXBwb3J0ZWRfcHJvdG9jb2xzWypdfSIKICAgICAgICBmaQogICAgICBmaQogICAgfQoKICAgICMgJEA6IGxpc3Qgb2Ygc3VwcG9ydGVkIHByb3RvY29scwogICAgc2V0X3Byb3h5KCkgewogICAgICBsb2NhbCBzdXBwb3J0ZWRfcHJvdG9jb2xzPSgiJEAiKQoKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKCiAgICAgIEhUVFBfUFJPWFlfVVJMPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5odHRwX3Byb3h5IC8vIGVtcHR5JykKICAgICAgSFRUUFNfUFJPWFlfVVJMPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5odHRwc19wcm94eSAvLyBlbXB0eScpCiAgICAgIGlmIFtbICQ/IC1uZSAwIHx8ICgteiAiJHtIVFRQX1BST1hZX1VSTH0iICYmIC16ICIke0hUVFBTX1BST1hZX1VSTH0iKSBdXTsgdGhlbgogICAgICAgIGVjaG8gIkluZm86IFRoZSBjb25maWctanNvbiB3YXMgcGFyc2VkLCBidXQgbm8gcHJveHkgc2V0dGluZ3Mgd2VyZSBmb3VuZC4iCiAgICAgICAgcmV0dXJuIDAKICAgICAgZmkKICAKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9

      which corresponds to the following script in plain-text format:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          set_proxy "http" "https" "socks5"
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          docker run -d $REGISTRY_URI_PATH/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
        
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
    • Image one-liner. Encode it in base64 format
      docker run -d nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag

      For example, for vectoradd-cuda11.7.1-ubi8, provide the following script in base64 format:

      ZG9ja2VyIHJ1biAtZCBudmNyLmlvL252aWRpYS9rOHMvY3VkYS1zYW1wbGU6dmVjdG9yYWRkLWN1ZGExMS43LjEtdWJpOA==

      which corresponds to the following script in plain-text format:

      docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
  • Enter the vGPU guest driver installation properties, such as vgpu-license and nvidia-portal-api-key.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, and the license is allocated, run the following command:

    nvidia-smi -q |grep -i license
  • Cloud-init script logs in /var/log/dl.log.

PyTorch

You can use a deep learning VM with a PyTorch library to explore conversational AI, NLP, and other types of AI models, on a VM. See the PyTorch page.

After the deep learning VM is launched, it starts a JupyterLab instance with PyTorch packages installed and configured.

Table 2. PyTorch Container Image
Component Description
Container image
nvcr.io/nvidia/pytorch:ngc_image_tag
For example:
nvcr.io/nvidia/pytorch:23.10-py3

For information on the PyTorch container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a PyTorch workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the PyTorch image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          docker run -d --gpus all -p 8888:8888 $REGISTRY_URI_PATH/nvidia/pytorch:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }

      For example, for pytorch:23.10-py3, provide the following script in base 64 format:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC1wIDg4ODg6ODg4OCAkUkVHSVNUUllfVVJJX1BBVEgvbnZpZGlhL3B5dG9yY2g6MjMuMTAtcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2UKCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9

      which corresponds to the following script in plain-text format.

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          docker run -d --gpus all -p 8888:8888 $REGISTRY_URI_PATH/nvidia/pytorch:23.10-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
    • Image one-liner. Encode it in base64 format.
      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      For example, for pytorch:23.10-py3, provide the following script in base 64 format:

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvcHl0b3JjaDoyMy4xMC1weTMgL3Vzci9sb2NhbC9iaW4vanVweXRlciBsYWIgLS1hbGxvdy1yb290IC0taXA9KiAtLXBvcnQ9ODg4OCAtLW5vLWJyb3dzZXIgLS1Ob3RlYm9va0FwcC50b2tlbj0nJyAtLU5vdGVib29rQXBwLmFsbG93X29yaWdpbj0nKicgLS1ub3RlYm9vay1kaXI9L3dvcmtzcGFjZQ==

      which corresponds to the following script in plain-text format:

      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch:23.10-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Enter the vGPU guest driver installation properties, such as vgpu-license and nvidia-portal-api-key.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.
  • PyTorch container.

    To verify that the PyTorch container is running, run the sudo docker ps -a and sudo docker logs container_id command.

  • JupyterLab instance that you can access at http://dl_vm_ip:8888

    In the terminal of JupyterLab, verify that the following functionality is available in the notebook:

    • To verify that JupyterLab can access the vGPU resource, run nvidia-smi.
    • To verify that the PyTorch related packages are installed, run pip show.

TensorFlow

You can use a deep learning VM with a TensorFlow library to explore conversational AI, NLP, and other types of AI models, on a VM. See the TensorFlow page.

After the deep learning VM is launched, it starts a JupyterLab instance with TensorFlow packages installed and configured.

Table 3. TensorFlow Container Image
Component Description
Container image
nvcr.io/nvidia/tensorflow:ngc_image_tag

For example:

nvcr.io/nvidia/tensorflow:23.10-tf2-py3

For information on the TensorFlow container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a TensorFlow workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the TensorFlow image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          docker run -d --gpus all -p 8888:8888 $REGISTRY_URI_PATH/nvidia/tensorflow:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }

      For example, for tensorflow:23.10-tf2-py3, provide the following script in base64 format:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCiAgICAKICAgIERFRkFVTFRfUkVHX1VSST0ibnZjci5pbyIKICAgIFJFR0lTVFJZX1VSSV9QQVRIPSQoZ3JlcCByZWdpc3RyeS11cmkgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQoKICAgIGlmIFtbIC16ICIkUkVHSVNUUllfVVJJX1BBVEgiIF1dOyB0aGVuCiAgICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggaXMgbnVsbCBvciBlbXB0eSwgdXNlIHRoZSBkZWZhdWx0IHZhbHVlCiAgICAgIFJFR0lTVFJZX1VSSV9QQVRIPSRERUZBVUxUX1JFR19VUkkKICAgICAgZWNobyAiUkVHSVNUUllfVVJJX1BBVEggd2FzIGVtcHR5LiBVc2luZyBkZWZhdWx0OiAkUkVHSVNUUllfVVJJX1BBVEgiCiAgICBmaQogICAgCiAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGNvbnRhaW5zICcvJywgZXh0cmFjdCB0aGUgVVJJIHBhcnQKICAgIGlmIFtbICRSRUdJU1RSWV9VUklfUEFUSCA9PSAqIi8iKiBdXTsgdGhlbgogICAgICBSRUdJU1RSWV9VUkk9JChlY2hvICIkUkVHSVNUUllfVVJJX1BBVEgiIHwgY3V0IC1kJy8nIC1mMSkKICAgIGVsc2UKICAgICAgUkVHSVNUUllfVVJJPSRSRUdJU1RSWV9VUklfUEFUSAogICAgZmkKICAKICAgIFJFR0lTVFJZX1VTRVJOQU1FPSQoZ3JlcCByZWdpc3RyeS11c2VyIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIFJFR0lTVFJZX1BBU1NXT1JEPSQoZ3JlcCByZWdpc3RyeS1wYXNzd2QgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgaWYgW1sgLW4gIiRSRUdJU1RSWV9VU0VSTkFNRSIgJiYgLW4gIiRSRUdJU1RSWV9QQVNTV09SRCIgXV07IHRoZW4KICAgICAgZG9ja2VyIGxvZ2luIC11ICRSRUdJU1RSWV9VU0VSTkFNRSAtcCAkUkVHSVNUUllfUEFTU1dPUkQgJFJFR0lTVFJZX1VSSQogICAgZWxzZQogICAgICBlY2hvICJXYXJuaW5nOiB0aGUgcmVnaXN0cnkncyB1c2VybmFtZSBhbmQgcGFzc3dvcmQgYXJlIGludmFsaWQsIFNraXBwaW5nIERvY2tlciBsb2dpbi4iCiAgICBmaQogICAgCiAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLXAgODg4ODo4ODg4ICRSRUdJU1RSWV9VUklfUEFUSC9udmlkaWEvdGVuc29yZmxvdzoyMy4xMC10ZjItcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2UKCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9

      which corresponds to the following script in plain-text format:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          docker run -d --gpus all -p 8888:8888 $REGISTRY_URI_PATH/nvidia/tensorflow:23.10-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
    • Image one-liner. Encode it in base64 format.
      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      For example, for tensorflow:23.10-tf2-py3, provide the following script in base64 format:

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvdGVuc29yZmxvdzoyMy4xMC10ZjItcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2U=

      which corresponds to the following script in plain-text format:

      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow:23.10-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Enter the vGPU guest driver installation properties, such as vgpu-license and nvidia-portal-api-key.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, log in to the VM over SSH and run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.
  • TensorFlow container.

    To verify that the TensorFlow container is running, run the sudo docker ps -a and sudo docker logs container_id commands.

  • JupyterLab instance that you can access at http://dl_vm_ip:8888.

    In the terminal of JupyterLab, verify that the following functionality is available in the notebook:

    • To verify that JupyterLab can access the vGPU resource, run nvidia-smi.
    • To verify that the TensorFlow related packages are installed, run pip show.

DCGM Exporter

You can use a deep learning VM with a Data Center GPU Manager (DCGM) exporter to monitor the health of and get metrics from GPUs used by a DL workload, using NVIDIA DCGM, Prometheus, and Grafana.

See the DCGM Exporter page.

In a deep learning VM, you run the DCGM Exporter container together with a DL workload that performs AI operations. After the deep learning VM is started, DCGM Exporter is ready to collect vGPU metrics and export the data to another application for further monitoring and visualization. You can run the monitored DL workload as a part of the cloud-init process or from the command line after the virtual machine is started.

Table 4. DCGM Exporter Container Image
Component Description
Container image
nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag

For example:

nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04

For information on the DCGM Exporter container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a DCGM Exporter workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the DCGM Exporter image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 $REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter:ngc_image_tag
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }

      For example, for a deep learning VM with a pre-installed a dcgm-exporter:3.2.5-3.1.8-ubuntu22.04 DCGM Exporter instance, provide the following script in base64 format

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCiAgICAKICAgIERFRkFVTFRfUkVHX1VSST0ibnZjci5pbyIKICAgIFJFR0lTVFJZX1VSSV9QQVRIPSQoZ3JlcCByZWdpc3RyeS11cmkgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQoKICAgIGlmIFtbIC16ICIkUkVHSVNUUllfVVJJX1BBVEgiIF1dOyB0aGVuCiAgICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggaXMgbnVsbCBvciBlbXB0eSwgdXNlIHRoZSBkZWZhdWx0IHZhbHVlCiAgICAgIFJFR0lTVFJZX1VSSV9QQVRIPSRERUZBVUxUX1JFR19VUkkKICAgICAgZWNobyAiUkVHSVNUUllfVVJJX1BBVEggd2FzIGVtcHR5LiBVc2luZyBkZWZhdWx0OiAkUkVHSVNUUllfVVJJX1BBVEgiCiAgICBmaQogICAgCiAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGNvbnRhaW5zICcvJywgZXh0cmFjdCB0aGUgVVJJIHBhcnQKICAgIGlmIFtbICRSRUdJU1RSWV9VUklfUEFUSCA9PSAqIi8iKiBdXTsgdGhlbgogICAgICBSRUdJU1RSWV9VUkk9JChlY2hvICIkUkVHSVNUUllfVVJJX1BBVEgiIHwgY3V0IC1kJy8nIC1mMSkKICAgIGVsc2UKICAgICAgUkVHSVNUUllfVVJJPSRSRUdJU1RSWV9VUklfUEFUSAogICAgZmkKICAKICAgIFJFR0lTVFJZX1VTRVJOQU1FPSQoZ3JlcCByZWdpc3RyeS11c2VyIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIFJFR0lTVFJZX1BBU1NXT1JEPSQoZ3JlcCByZWdpc3RyeS1wYXNzd2QgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgaWYgW1sgLW4gIiRSRUdJU1RSWV9VU0VSTkFNRSIgJiYgLW4gIiRSRUdJU1RSWV9QQVNTV09SRCIgXV07IHRoZW4KICAgICAgZG9ja2VyIGxvZ2luIC11ICRSRUdJU1RSWV9VU0VSTkFNRSAtcCAkUkVHSVNUUllfUEFTU1dPUkQgJFJFR0lTVFJZX1VSSQogICAgZWxzZQogICAgICBlY2hvICJXYXJuaW5nOiB0aGUgcmVnaXN0cnkncyB1c2VybmFtZSBhbmQgcGFzc3dvcmQgYXJlIGludmFsaWQsIFNraXBwaW5nIERvY2tlciBsb2dpbi4iCiAgICBmaQoKICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC0tcm0gLXAgOTQwMDo5NDAwICRSRUdJU1RSWV9VUklfUEFUSC9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXI6My4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQKCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9
      which corresponds to the following script in plain-text format:
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 $REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      Note: You can also add the instructions for running the DL workload whose GPU performance you want to measure with DCGM Exporter to the cloud-init script.
    • Image one-liner. Encode it in base64 format.
      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag-ubuntu22.04

      For example, for dcgm-exporter:3.2.5-3.1.8-ubuntu22.04, provide the following script in base64 format:

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLS1ybSAtcCA5NDAwOjk0MDAgbnZjci5pby9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXI6My4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQ=

      which corresponds to the following script in plain-text format:

      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
  • Enter the vGPU guest driver installation properties, such as vgpu-license and nvidia-portal-api-key.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, log in to the VM over SSH and run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.
  • DCGM Exporter that you can access at http://dl_vm_ip:9400.

Next, in the deep learning VM, you run a DL workload, and visualize the data on another virtual machine by using Prometheus at http://visualization_vm_ip:9090 and Grafana at http://visualization_vm_ip:3000.

Run a DL Workload on the Deep Leaning VM

Run the DL workload you want to collect vGPU metrics for and export the data to another application for further monitoring and visualization.

  1. Log in to the deep learning VM as vmware over SSH.
  2. Add the vmware user account to the docker group by running the following command.
    sudo usermod -aG docker ${USER}
  3. Run the container for the DL workload, pulling it from the NVIDIA NGC catalog or from a local container registry.

    For example, to run the following command to run the tensorflow:23.10-tf2-py3 image from NVIDIA NGC:

    docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow:23.10-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  4. Start using the DL workload for AI development.

Install Prometheus and Grafana

You can visualize and monitor the vGPU metrics from the DCGM Exporter virtual machine on a virtual machine running Prometheus and Grafana.

  1. Create a visualization VM with Docker Community Engine installed.
  2. Connect to the VM over SSH and create a YAML file for Prometheus.
    $ cat > prometheus.yml << EOF
    global:
      scrape_interval: 15s
      external_labels:
        monitor: 'codelab-monitor'
    scrape_configs:
      - job_name: 'dcgm'
        scrape_interval: 5s
        metrics_path: /metrics
        static_configs:
          - targets: [dl_vm_with_dcgm_exporter_ip:9400']
    EOF
    
  3. Create a data path.
    $ mkdir grafana_data prometheus_data && chmod 777 grafana_data prometheus_data
    
  4. Create a Docker compose file to install Prometheus and Grafana.
    $ cat > compose.yaml << EOF
    services:
      prometheus:
        image: prom/prometheus:v2.47.2
        container_name: "prometheus0"
        restart: always
        ports:
          - "9090:9090"
        volumes:
          - "./prometheus.yml:/etc/prometheus/prometheus.yml"
          - "./prometheus_data:/prometheus"
      grafana:
        image: grafana/grafana:10.2.0-ubuntu
        container_name: "grafana0"
        ports:
          - "3000:3000"
        restart: always
        volumes:
          - "./grafana_data:/var/lib/grafana"
    EOF
    
  5. Start the Prometheus and Grafana containers.
    $ sudo docker compose up -d        
    

View vGPU Metrics in Prometheus

You can access Prometheus at http://visualization-vm-ip:9090. You can view the following vGPU information in the Prometheus UI:

Information UI Section
Raw vGPU metrics from the deep learning VM Status > Target

To view the raw vGPU metrics from the deep learning VM, click the endpoint entry.

Graph expressions
  1. On the main navigation bar, click the Graph tab.
  2. Enter an expression and click Execute

For more information on using Prometheus, see the Prometheus documentation.

Visualize Metrics in Grafana

Set Prometheus as a data source for Grafana and visualize the vGPU metrics from the deep learning VM in a dashboard.

  1. Access Grafana at http://visualization-vm-ip:3000 by using the default user name admin and password admin.
  2. Add Prometheus as the first data source, connecting to visualization-vm-ip on port 9090.
  3. Create a dashboard with the vGPU metrics.

For more information on configuring a dashboard using a Prometheus data source, see the Grafana documentation.

Triton Inference Server

You can use a deep learning VM with a Triton Inference Server for loading a model repository and receive inference requests.

See the Triton Inference Server page.

Table 5. Triton Inference Server Container Image
Component Description
Container image
nvcr.io/nvidia/tritonserver:ngc_image_tag

For example:

nvcr.io/nvidia/tritonserver:23.10-py3

For information on the Triton Inference Server container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a Triton Inference Server workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the Triton Inference Server image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          docker run -d --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/vmware/model_repository:/models $REGISTRY_URI_PATH/nvidia/tritonserver:ngc_image_tag tritonserver --model-repository=/models --model-control-mode=poll
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }

      For example, for tritonserver:23.10-py3, provide the following script in base64 format:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tcm0gLXAgODAwMDo4MDAwIC1wIDgwMDE6ODAwMSAtcCA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyAkUkVHSVNUUllfVVJJX1BBVEgvbnZpZGlhL3RyaXRvbnNlcnZlcjoyMy4xMC1weTMgdHJpdG9uc2VydmVyIC0tbW9kZWwtcmVwb3NpdG9yeT0vbW9kZWxzIC0tbW9kZWwtY29udHJvbC1tb2RlPXBvbGwKCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9

      which corresponds to the following script in plain-text format:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          docker run -d --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/vmware/model_repository:/models $REGISTRY_URI_PATH/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/models --model-control-mode=poll
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            mkdir -p /etc/systemd/system/docker.service.d
            echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
            systemctl daemon-reload
            systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
    • Image one-liner encoded in base64 format
      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver:ngc_image_tag tritonserver --model-repository=/models --model-control-mode=poll

      For example, for tritonserver:23.10-py3, provide the following script in base 64 format:

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tcm0gLXA4MDAwOjgwMDAgLXA4MDAxOjgwMDEgLXA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyBudmNyLmlvL252aWRpYS90cml0b25zZXJ2ZXI6MjMuMTAtcHkzIHRyaXRvbnNlcnZlciAtLW1vZGVsLXJlcG9zaXRvcnk9L21vZGVscyAtLW1vZGVsLWNvbnRyb2wtbW9kZT1wb2xs

      which corresponds to the following script in plain-text format:

      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/models --model-control-mode=poll
  • Enter the vGPU guest driver installation properties, such as vgpu-license and nvidia-portal-api-key.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, log in to the VM over SSH and run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.
  • Triton Inference Server container.

    To verify that the Triton Inference Server container is running, run the sudo docker ps -a and sudo docker logs container_id commands.

The model repository for the Triton Inference Server is in /home/vmware/model_repository. Initially, the model repository is empty and the initial log of the Triton Inference Server instance shows that no model is loaded.

Create a Model Repository

To load your model for model inference, perform these steps:

  1. Create the model repository for your model.

    See the NVIDIA Triton Inference Server Model Repository documentation .

  2. Copy the model repository to /home/vmware/model_repository so that the Triton Inference Server can load it.
    sudo cp -r path_to_your_created_model_repository/* /home/vmware/model_repository/
    

Send Model Inference Requests

  1. Verify that the Triton Inference Server is healthy and models are ready by running this command in the deep learning VM console.
    curl -v localhost:8000/v2/simple_sequence
  2. Send a request to the model by running this command on the deep learning VM.
     curl -v localhost:8000/v2/models/simple_sequence

For more information on using the Triton Inference Server, see NVIDIA Triton Inference Server Model Repository documentation.

NVIDIA RAG

You can use a deep learning VM to build Retrieval Augmented Generation (RAG) solutions with an Llama2 model.

See the NVIDIA RAG Applications Docker Compose documentation (requires specific account permissions).

Table 6. NVIDIA RAG Container Image
Component Description
Container images and models
rag-app-text-chatbot.yaml
in the NVIDIA sample RAG pipeline.

For information on the NVIDIA RAG container applications that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy an NVIDIA RAG workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Enter a cloud-init script. Encode it in base64 format.

    For example, for version 24.03 of NVIDIA RAG, provide the following script:

    #cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    set -eu
    source /opt/dlvm/utils.sh
    trap 'error_exit "Unexpected error occurs at dl workload"' ERR
    set_proxy "http" "https"

    cat <<EOF > /opt/dlvm/config.json
    {
      "_comment": "This provides default support for RAG: TensorRT inference, llama2-13b model, and H100x2 GPU",
      "rag": {
        "org_name": "cocfwga8jq2c",
        "org_team_name": "no-team",
        "rag_repo_name": "nvidia/paif",
        "llm_repo_name": "nvidia/nim",
        "embed_repo_name": "nvidia/nemo-retriever",
        "rag_name": "rag-docker-compose",
        "rag_version": "24.03",
        "embed_name": "nv-embed-qa",
        "embed_type": "NV-Embed-QA",
        "embed_version": "4",
        "inference_type": "trt",
        "llm_name": "llama2-13b-chat",
        "llm_version": "h100x2_fp16_24.02",
        "num_gpu": "2",
        "hf_token": "huggingface token to pull llm model, update when using vllm inference",
        "hf_repo": "huggingface llm model repository, update when using vllm inference"
      }
    }
    EOF
    CONFIG_JSON=$(cat "/opt/dlvm/config.json")
    INFERENCE_TYPE=$(echo "${CONFIG_JSON}" | jq -r '.rag.inference_type')
    if [ "${INFERENCE_TYPE}" = "trt" ]; then
      required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "LLM_VERSION" "NUM_GPU")
    elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
      required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "NUM_GPU" "HF_TOKEN" "HF_REPO")
    else
      error_exit "Inference type '${INFERENCE_TYPE}' is not recognized. No action will be taken."
    fi
    for index in "${!required_vars[@]}"; do
      key="${required_vars[$index]}"
      jq_query=".rag.${key,,} | select (.!=null)"
      value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
      if [[ -z "${value}" ]]; then 
        error_exit "${key} is required but not set."
      else
        eval ${key}=\""${value}"\"
      fi
    done

    RAG_URI="${RAG_REPO_NAME}/${RAG_NAME}:${RAG_VERSION}"
    EMBED_MODEL_URI="${EMBED_REPO_NAME}/${EMBED_NAME}:${EMBED_VERSION}"

    NGC_CLI_VERSION="3.41.2"
    NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"

    mkdir -p /opt/data
    cd /opt/data

    if [ ! -f .file_downloaded ]; then
      # clean up
      rm -rf compose.env ${RAG_NAME}* ${LLM_NAME}* ngc* ${EMBED_NAME}* *.json .file_downloaded

      # install ngc-cli
      wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip ngccli_linux.zip
      export PATH=`pwd`/ngc-cli:${PATH}

      APIKEY=""
      REG_URI="nvcr.io"

      if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${REG_URI}"* ]]; then
        APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      fi

      if [ -z "${APIKEY}" ]; then
          error_exit "No APIKEY found"
      fi

      # config ngc-cli
      mkdir -p ~/.ngc

      cat << EOF > ~/.ngc/config
      [CURRENT]
      apikey = ${APIKEY}
      format_type = ascii
      org = ${ORG_NAME}
      team = ${ORG_TEAM_NAME}
      ace = no-ace
    EOF

      # ngc docker login
      docker login nvcr.io -u \$oauthtoken -p ${APIKEY}

      # dockerhub login for general components, e.g. minio
      DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

      if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
        docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD}
      else
        echo "Warning: DockerHub not login"
      fi

      # get RAG files
      ngc registry resource download-version ${RAG_URI}

      # get llm model
      if [ "${INFERENCE_TYPE}" = "trt" ]; then
        LLM_MODEL_URI="${LLM_REPO_NAME}/${LLM_NAME}:${LLM_VERSION}"
        ngc registry model download-version ${LLM_MODEL_URI}
        chmod -R o+rX ${LLM_NAME}_v${LLM_VERSION}
        LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}_v${LLM_VERSION}"
      elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
        pip install huggingface_hub
        huggingface-cli login --token ${HF_TOKEN}
        huggingface-cli download --resume-download ${HF_REPO}/${LLM_NAME} --local-dir ${LLM_NAME} --local-dir-use-symlinks False
        LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}"
        cat << EOF > ${LLM_MODEL_FOLDER}/model_config.yaml 
        engine:
          model: /model-store
          enforce_eager: false
          max_context_len_to_capture: 8192
          max_num_seqs: 256
          dtype: float16
          tensor_parallel_size: ${NUM_GPU}
          gpu_memory_utilization: 0.8
    EOF
        chmod -R o+rX ${LLM_MODEL_FOLDER}
        python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"> rag-app-text-chatbot.json
        jq '.services."nemollm-inference".image = "nvcr.io/nvidia/nim/nim_llm:24.02-day0" |
            .services."nemollm-inference".command = "nim_vllm --model_name ${MODEL_NAME} --model_config /model-store/model_config.yaml" |
            .services."nemollm-inference".ports += ["8000:8000"] |
            .services."nemollm-inference".expose += ["8000"]' rag-app-text-chatbot.json > temp.json && mv temp.json rag-app-text-chatbot.json
        python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-text-chatbot.json > "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"
      fi

      # get embedding models
      ngc registry model download-version ${EMBED_MODEL_URI}
      chmod -R o+rX ${EMBED_NAME}_v${EMBED_VERSION}

      # config compose.env
      cat << EOF > compose.env
      export MODEL_DIRECTORY="${LLM_MODEL_FOLDER}"
      export MODEL_NAME=${LLM_NAME}
      export NUM_GPU=${NUM_GPU}
      export APP_CONFIG_FILE=/dev/null
      export EMBEDDING_MODEL_DIRECTORY="/opt/data/${EMBED_NAME}_v${EMBED_VERSION}"
      export EMBEDDING_MODEL_NAME=${EMBED_TYPE}
      export EMBEDDING_MODEL_CKPT_NAME="${EMBED_TYPE}-${EMBED_VERSION}.nemo"
    EOF

      touch .file_downloaded
    fi

    # start NGC RAG
    docker compose -f ${RAG_NAME}_v${RAG_VERSION}/docker-compose-vectordb.yaml up -d pgvector
    source compose.env; docker compose -f ${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml up -d

- path: /opt/dlvm/utils.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    error_exit() {
      echo "Error: $1" >&2
      vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
      exit 1
    }

    check_protocol() {
      local proxy_url=$1
      shift
      local supported_protocols=("$@")
      if [[ -n "${proxy_url}" ]]; then
        local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
        if [ -z "$protocol" ]; then
          echo "No specific protocol provided. Skipping protocol check."
          return 0
        fi
        local protocol_included=false
        for var in "${supported_protocols[@]}"; do
          if [[ "${protocol}" == "${var}" ]]; then
            protocol_included=true
            break
          fi
        done
        if [[ "${protocol_included}" == false ]]; then
          error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
        fi
      fi
    }

    # $@: list of supported protocols
    set_proxy() {
      local supported_protocols=("$@")

      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)

      HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
      HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
      if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
        echo "Info: The config-json was parsed, but no proxy settings were found."
        return 0
      fi

      check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
      check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"

      if ! grep -q 'http_proxy' /etc/environment; then
        echo "export http_proxy=${HTTP_PROXY_URL}
        export https_proxy=${HTTPS_PROXY_URL}
        export HTTP_PROXY=${HTTP_PROXY_URL}
        export HTTPS_PROXY=${HTTPS_PROXY_URL}
        export no_proxy=localhost,127.0.0.1" >> /etc/environment
        source /etc/environment
      fi
      
      # Configure Docker to use a proxy
      mkdir -p /etc/systemd/system/docker.service.d
      echo "[Service]
      Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
      Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
      Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
      systemctl daemon-reload
      systemctl restart docker

      echo "Info: docker and system environment are now configured to use the proxy settings"
    }
    which corresponds to the following script in plain-text format:
    #cloud-config
    write_files:
    - path: /opt/dlvm/dl_app.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        set -eu
        source /opt/dlvm/utils.sh
        trap 'error_exit "Unexpected error occurs at dl workload"' ERR
        set_proxy "http" "https"
    
        cat <<EOF > /opt/dlvm/config.json
        {
          "_comment": "This provides default support for RAG: TensorRT inference, llama2-13b model, and H100x2 GPU",
          "rag": {
            "org_name": "cocfwga8jq2c",
            "org_team_name": "no-team",
            "rag_repo_name": "nvidia/paif",
            "llm_repo_name": "nvidia/nim",
            "embed_repo_name": "nvidia/nemo-retriever",
            "rag_name": "rag-docker-compose",
            "rag_version": "24.03",
            "embed_name": "nv-embed-qa",
            "embed_type": "NV-Embed-QA",
            "embed_version": "4",
            "inference_type": "trt",
            "llm_name": "llama2-13b-chat",
            "llm_version": "h100x2_fp16_24.02",
            "num_gpu": "2",
            "hf_token": "huggingface token to pull llm model, update when using vllm inference",
            "hf_repo": "huggingface llm model repository, update when using vllm inference"
          }
        }
        EOF
        CONFIG_JSON=$(cat "/opt/dlvm/config.json")
        INFERENCE_TYPE=$(echo "${CONFIG_JSON}" | jq -r '.rag.inference_type')
        if [ "${INFERENCE_TYPE}" = "trt" ]; then
          required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "LLM_VERSION" "NUM_GPU")
        elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
          required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "NUM_GPU" "HF_TOKEN" "HF_REPO")
        else
          error_exit "Inference type '${INFERENCE_TYPE}' is not recognized. No action will be taken."
        fi
        for index in "${!required_vars[@]}"; do
          key="${required_vars[$index]}"
          jq_query=".rag.${key,,} | select (.!=null)"
          value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
          if [[ -z "${value}" ]]; then 
            error_exit "${key} is required but not set."
          else
            eval ${key}=\""${value}"\"
          fi
        done
    
        RAG_URI="${RAG_REPO_NAME}/${RAG_NAME}:${RAG_VERSION}"
        EMBED_MODEL_URI="${EMBED_REPO_NAME}/${EMBED_NAME}:${EMBED_VERSION}"
    
        NGC_CLI_VERSION="3.41.2"
        NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"
    
        mkdir -p /opt/data
        cd /opt/data
    
        if [ ! -f .file_downloaded ]; then
          # clean up
          rm -rf compose.env ${RAG_NAME}* ${LLM_NAME}* ngc* ${EMBED_NAME}* *.json .file_downloaded
    
          # install ngc-cli
          wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip ngccli_linux.zip
          export PATH=`pwd`/ngc-cli:${PATH}
    
          APIKEY=""
          REG_URI="nvcr.io"
    
          if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${REG_URI}"* ]]; then
            APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          fi
    
          if [ -z "${APIKEY}" ]; then
              error_exit "No APIKEY found"
          fi
    
          # config ngc-cli
          mkdir -p ~/.ngc
    
          cat << EOF > ~/.ngc/config
          [CURRENT]
          apikey = ${APIKEY}
          format_type = ascii
          org = ${ORG_NAME}
          team = ${ORG_TEAM_NAME}
          ace = no-ace
        EOF
    
          # ngc docker login
          docker login nvcr.io -u \$oauthtoken -p ${APIKEY}
    
          # dockerhub login for general components, e.g. minio
          DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
          if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
            docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD}
          else
            echo "Warning: DockerHub not login"
          fi
    
          # get RAG files
          ngc registry resource download-version ${RAG_URI}
    
          # get llm model
          if [ "${INFERENCE_TYPE}" = "trt" ]; then
            LLM_MODEL_URI="${LLM_REPO_NAME}/${LLM_NAME}:${LLM_VERSION}"
            ngc registry model download-version ${LLM_MODEL_URI}
            chmod -R o+rX ${LLM_NAME}_v${LLM_VERSION}
            LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}_v${LLM_VERSION}"
          elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
            pip install huggingface_hub
            huggingface-cli login --token ${HF_TOKEN}
            huggingface-cli download --resume-download ${HF_REPO}/${LLM_NAME} --local-dir ${LLM_NAME} --local-dir-use-symlinks False
            LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}"
            cat << EOF > ${LLM_MODEL_FOLDER}/model_config.yaml 
            engine:
              model: /model-store
              enforce_eager: false
              max_context_len_to_capture: 8192
              max_num_seqs: 256
              dtype: float16
              tensor_parallel_size: ${NUM_GPU}
              gpu_memory_utilization: 0.8
        EOF
            chmod -R o+rX ${LLM_MODEL_FOLDER}
            python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"> rag-app-text-chatbot.json
            jq '.services."nemollm-inference".image = "nvcr.io/nvidia/nim/nim_llm:24.02-day0" |
                .services."nemollm-inference".command = "nim_vllm --model_name ${MODEL_NAME} --model_config /model-store/model_config.yaml" |
                .services."nemollm-inference".ports += ["8000:8000"] |
                .services."nemollm-inference".expose += ["8000"]' rag-app-text-chatbot.json > temp.json && mv temp.json rag-app-text-chatbot.json
            python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-text-chatbot.json > "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"
          fi
    
          # get embedding models
          ngc registry model download-version ${EMBED_MODEL_URI}
          chmod -R o+rX ${EMBED_NAME}_v${EMBED_VERSION}
    
          # config compose.env
          cat << EOF > compose.env
          export MODEL_DIRECTORY="${LLM_MODEL_FOLDER}"
          export MODEL_NAME=${LLM_NAME}
          export NUM_GPU=${NUM_GPU}
          export APP_CONFIG_FILE=/dev/null
          export EMBEDDING_MODEL_DIRECTORY="/opt/data/${EMBED_NAME}_v${EMBED_VERSION}"
          export EMBEDDING_MODEL_NAME=${EMBED_TYPE}
          export EMBEDDING_MODEL_CKPT_NAME="${EMBED_TYPE}-${EMBED_VERSION}.nemo"
        EOF
    
          touch .file_downloaded
        fi
    
        # start NGC RAG
        docker compose -f ${RAG_NAME}_v${RAG_VERSION}/docker-compose-vectordb.yaml up -d pgvector
        source compose.env; docker compose -f ${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml up -d
    
    - path: /opt/dlvm/utils.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        error_exit() {
          echo "Error: $1" >&2
          vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
          exit 1
        }
    
        check_protocol() {
          local proxy_url=$1
          shift
          local supported_protocols=("$@")
          if [[ -n "${proxy_url}" ]]; then
            local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
            if [ -z "$protocol" ]; then
              echo "No specific protocol provided. Skipping protocol check."
              return 0
            fi
            local protocol_included=false
            for var in "${supported_protocols[@]}"; do
              if [[ "${protocol}" == "${var}" ]]; then
                protocol_included=true
                break
              fi
            done
            if [[ "${protocol_included}" == false ]]; then
              error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
            fi
          fi
        }
    
        # $@: list of supported protocols
        set_proxy() {
          local supported_protocols=("$@")
    
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
    
          HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
          HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
          if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
            echo "Info: The config-json was parsed, but no proxy settings were found."
            return 0
          fi
    
          check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
          check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
    
          if ! grep -q 'http_proxy' /etc/environment; then
            echo "export http_proxy=${HTTP_PROXY_URL}
            export https_proxy=${HTTPS_PROXY_URL}
            export HTTP_PROXY=${HTTP_PROXY_URL}
            export HTTPS_PROXY=${HTTPS_PROXY_URL}
            export no_proxy=localhost,127.0.0.1" >> /etc/environment
            source /etc/environment
          fi
          
          # Configure Docker to use a proxy
          mkdir -p /etc/systemd/system/docker.service.d
          echo "[Service]
          Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
          Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
          Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
          systemctl daemon-reload
          systemctl restart docker
    
          echo "Info: docker and system environment are now configured to use the proxy settings"
        }
  • Enter the vGPU guest driver installation properties, such as vgpu-license and nvidia-portal-api-key.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, log in to the VM over SSH and run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.

    To track deployment progress, run tail -f /var/log/dl.log .

  • Sample chatbot Web application that you can access at http://dl_vm_ip:3001/orgs/nvidia/models/text-qa-chatbot

    You can upload your own knowledge base.

Assign a Static IP Address to a Deep Learning VM in VMware Private AI Foundation with NVIDIA

By default, the deep learning VM images are configured with DHCP address assignment. If you want to deploy a deep learning VM with a static IP address directly on a vSphere cluster, you must add additional code to the cloud-init section.

On vSphere with Tanzu, IP address assignment is determined by the network configuration for the Supervisor in NSX.

Procedure

  1. Create a cloud-init script in plain-text format for the DL workload you plan to use.
  2. Add the following additional code to the cloud-init script.
    #cloud-config
    <instructions_for_your_DL_workload>
    
    manage_etc_hosts: true
     
    write_files:
      - path: /etc/netplan/50-cloud-init.yaml
        permissions: '0600'
        content: |
          network:
            version: 2
            renderer: networkd
            ethernets:
              ens33:
                dhcp4: false # disable DHCP4
                addresses: [x.x.x.x/x]  # Set the static IP address and mask
                routes:
                    - to: default
                      via: x.x.x.x # Configure gateway
                nameservers:
                  addresses: [x.x.x.x, x.x.x.x] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
     
    runcmd:
      - netplan apply
  3. Encode the resulting cloud-init script in base64 format.
  4. Set the resulting cloud-init script in base64 format as a value to the user-data OVF parameter of the deep learning VM image.

Example: Assigning a Static IP Address to a CUDA Sample Workload

For an example deep learning VM with a CUDA Sample DL workload:

Deep Learning VM Element Example Value
DL workload image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
IP address 10.199.118.245
Subnet prefix /25
Gateway 10.199.118.253
DNS servers
  • 10.142.7.1
  • 10.132.7.1

you provide the following cloud-init code:

I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIG52Y3IuaW8vbnZpZGlhL2s4cy9jdWRhLXNhbXBsZTp2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4CgptYW5hZ2VfZXRjX2hvc3RzOiB0cnVlCiAKd3JpdGVfZmlsZXM6CiAgLSBwYXRoOiAvZXRjL25ldHBsYW4vNTAtY2xvdWQtaW5pdC55YW1sCiAgICBwZXJtaXNzaW9uczogJzA2MDAnCiAgICBjb250ZW50OiB8CiAgICAgIG5ldHdvcms6CiAgICAgICAgdmVyc2lvbjogMgogICAgICAgIHJlbmRlcmVyOiBuZXR3b3JrZAogICAgICAgIGV0aGVybmV0czoKICAgICAgICAgIGVuczMzOgogICAgICAgICAgICBkaGNwNDogZmFsc2UgIyBkaXNhYmxlIERIQ1A0CiAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE5OS4xMTguMjQ1LzI1XSAgIyBTZXQgdGhlIHN0YXRpYyBJUCBhZGRyZXNzIGFuZCBtYXNrCiAgICAgICAgICAgIHJvdXRlczoKICAgICAgICAgICAgICAgIC0gdG86IGRlZmF1bHQKICAgICAgICAgICAgICAgICAgdmlhOiAxMC4xOTkuMTE4LjI1MyAjIENvbmZpZ3VyZSBnYXRld2F5CiAgICAgICAgICAgIG5hbWVzZXJ2ZXJzOgogICAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE0Mi43LjEsIDEwLjEzMi43LjFdICMgUHJvdmlkZSB0aGUgRE5TIHNlcnZlciBhZGRyZXNzLiBTZXBhcmF0ZSBtdWxpdHBsZSBETlMgc2VydmVyIGFkZHJlc3NlcyB3aXRoIGNvbW1hcy4KIApydW5jbWQ6CiAgLSBuZXRwbGFuIGFwcGx5

which corresponds to the following script in plain-text format:

#cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

manage_etc_hosts: true
 
write_files:
  - path: /etc/netplan/50-cloud-init.yaml
    permissions: '0600'
    content: |
      network:
        version: 2
        renderer: networkd
        ethernets:
          ens33:
            dhcp4: false # disable DHCP4
            addresses: [10.199.118.245/25]  # Set the static IP address and mask
            routes:
                - to: default
                  via: 10.199.118.253 # Configure gateway
            nameservers:
              addresses: [10.142.7.1, 10.132.7.1] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
 
runcmd:
  - netplan apply

Configure a Deep Learning VM with a Proxy Server

To connect your deep learning VM to the Internet in a disconnected environment where Internet access is over a proxy server, you must provide the proxy server details in the config.json file in the virtual machine.

Procedure

  1. Create a JSON file with the properties for proxy server.
    Proxy server that does not require authentication
    {  
      "http_proxy": "protocol://ip-address-or-fqdn:port",
      "https_proxy": "protocol://ip-address-or-fqdn:port"
    }
    Proxy server that requires authentication
    {  
      "http_proxy": "protocol://username:password@ip-address-or-fqdn:port",
      "https_proxy": "protocol://username:password@ip-address-or-fqdn:port"
    }

    where:

    • protocol is the communication protocol used by the proxy server, such as http or https.
    • username and password are the credentials for authentication to the proxy server. If the proxy server does not require authentication, skip these parameters.
    • ip-address-or-fqdn: The IP address or host name of the proxy server.
    • port: The port number on which the proxy server is listening for incoming requests.
  2. Encode the resulting JSON code in base64 format.
  3. When you deploy the the deep learning VM image, add the encoded value to the config-json OVF property.