The deep learning virtual machine images delivered as part of VMware Private AI Foundation with NVIDIA are preconfigured with popular ML libraries, frameworks, and toolkits, and are optimized and validated by NVIDIA and VMware for GPU acceleration in a VMware Cloud Foundation environment.

Data scientists can use the deep learning virtual machines provisioned from these images for AI prototyping, fine tuning, validation, and inference.

The software stack for running AI applications on top of NVIDIA GPUs is validated in advance. As a result, you directly start AI developing, without spending time installing and validating the compatibility of operating systems, software libraries, ML frameworks, toolkits, and GPU drivers.

What Does a Deep Learning VM Image Contain?

The initial deep learning virtual machine image contains the following software. For information on the component versions in each deep learning VM image release, see VMware Deep Learning VM Release Notes.

Software Component Category Software Component
Embedded
  • Canonical Ubuntu
  • NVIDIA Container Toolkit
  • Docker Community Engine
Can be pre-installed automatically

Content Library for Deep Learning VM Images

Deep learning VM images are delivered as vSphere VM templates, hosted and published by VMware in a content library. You can use these images to deploy Deep learning VM by using the vSphere Client or VMware Aria Automation.

The content library with deep learning VM images for VMware Private AI Foundation with NVIDIA is available at the https://packages.vmware.com/dl-vm/lib.json URL. In a connected environment, you create a subscribed content library connected to this URL, and in a disconnected environment - a local content library where you upload images downloaded from the central URL.

OVF Properties of Deep Learning VMs

When you deploy a deep learning VM, you must fill in custom VM properties to automate the configuration of the Linux operating system, the deployment of the vGPU guest driver, and the deployment and configuration of NGC containers for the DL workloads.

Category Parameter Description
Base OS Properties instance-id Required. A unique instance ID for the VM instance.

An instance ID uniquely identifies an instance. When an instance ID changes, cloud-init treats the instance as a new instance and runs the cloud-init process to again.

hostname Required. The host name of the appliance.
public-keys If provided, the instance populates the default user's SSH authorized_keys with this value.
user-data

A set of scripts or other metadata that is inserted into the VM at provisioning time.

This property is the actual the cloud-init script. This value must be base64 encoded.

password Required. The password for the default vmware user account.

vGPU Driver Installation

vgpu-license Required. The NVIDIA vGPU client configuration token. The token is saved in the /etc/nvidia/ClientConfigToken/client_configuration_token.tok file.
nvidia-portal-api-key

Required in a connected environment. The API key you downloaded from the NVIDIA Licensing Portal. The key is required for vGPU guest driver installation.

vgpu-fallback-version The version of the vGPU guest driver to fall back to if the version of the vGPU guest driver cannot be determined by using the entered license API key.
vgpu-url

Required in a disconnected environment. The URL to download the vGPU guest driver from.

DL Workload Automation registry-uri Required in a disconnected environment or if you plan to use a private container registry to avoid downloading images from the Internet. The URI of a private container registry with the deep learning workload container images.

Required if you are referring to a private registry in user-data or image-oneliner.

registry-user Required if you are using a private container registry that requires basic authentication.
registry-passwd Required if you are using a private container registry that requires basic authentication.
registry-2-uri Required if you are using a second private container registry that is based on Docker and required basic authentication.
registry-2-user Required if you are using a second private container registry.
registry-2-passwd Required if you are using a second private container registry.
image-oneliner A one-line bash command that is run at VM provisioning. This value must be base64 encoded.

You can use this property to specify the DL workload container you want to deploy, such as PyTorch or TensorFlow. See Deep Learning Workloads in VMware Private AI Foundation with NVIDIA.

Note: If both user-data and image-oneliner are provided, the value of user-data is used.
docker-compose-uri URI of the Docker compose file. Required if you need a Docker compose file to start the DL workload container. This value must be base64 encoded.
config-json Configuration file for multiple container registry login operations when using a Docker compose file. This value must be base64 encoded.

Assign a Static IP Address to a Deep Learning VM in VMware Private AI Foundation with NVIDIA

By default, the deep learning VM images are configured with DHCP address assignment. If you want to deploy a deep learning VM with a static IP address directly on a vSphere cluster, you must add additional code to the cloud-init section.

On vSphere with Tanzu, IP address assignment is determined by the network configuration for the Supervisor in NSX.

Procedure

  1. Create a cloud-init script in plain-text format for the DL workload you plan to use.
  2. Add the following additional code to the cloud-init script.
    #cloud-config
    <instructions_for_your_DL_workload>
    
    manage_etc_hosts: true
     
    write_files:
      - path: /etc/netplan/50-cloud-init.yaml
        permissions: '0600'
        content: |
          network:
            version: 2
            renderer: networkd
            ethernets:
              ens33:
                dhcp4: false # disable DHCP4
                addresses: [x.x.x.x/x]  # Set the static IP address and mask
                routes:
                    - to: default
                      via: x.x.x.x # Configure gateway
                nameservers:
                  addresses: [x.x.x.x, x.x.x.x] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
     
    runcmd:
      - netplan apply
  3. Encode the resulting cloud-init script in base64 format.
  4. Set the resulting cloud-init script in base64 format as a value to the user-data OVF parameter of the deep learning VM image.

Example: Assigning a Static IP Address to a CUDA Sample Workload

For an example deep learning VM with a CUDA Sample DL workload:

Deep Learning VM Element Example Value
DL workload image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
IP address 10.199.118.245
Subnet prefix /25
Gateway 10.199.118.253
DNS servers
  • 10.142.7.1
  • 10.132.7.1

you provide the following cloud-init code:

I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIG52Y3IuaW8vbnZpZGlhL2s4cy9jdWRhLXNhbXBsZTp2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4CgptYW5hZ2VfZXRjX2hvc3RzOiB0cnVlCiAKd3JpdGVfZmlsZXM6CiAgLSBwYXRoOiAvZXRjL25ldHBsYW4vNTAtY2xvdWQtaW5pdC55YW1sCiAgICBwZXJtaXNzaW9uczogJzA2MDAnCiAgICBjb250ZW50OiB8CiAgICAgIG5ldHdvcms6CiAgICAgICAgdmVyc2lvbjogMgogICAgICAgIHJlbmRlcmVyOiBuZXR3b3JrZAogICAgICAgIGV0aGVybmV0czoKICAgICAgICAgIGVuczMzOgogICAgICAgICAgICBkaGNwNDogZmFsc2UgIyBkaXNhYmxlIERIQ1A0CiAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE5OS4xMTguMjQ1LzI1XSAgIyBTZXQgdGhlIHN0YXRpYyBJUCBhZGRyZXNzIGFuZCBtYXNrCiAgICAgICAgICAgIHJvdXRlczoKICAgICAgICAgICAgICAgIC0gdG86IGRlZmF1bHQKICAgICAgICAgICAgICAgICAgdmlhOiAxMC4xOTkuMTE4LjI1MyAjIENvbmZpZ3VyZSBnYXRld2F5CiAgICAgICAgICAgIG5hbWVzZXJ2ZXJzOgogICAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE0Mi43LjEsIDEwLjEzMi43LjFdICMgUHJvdmlkZSB0aGUgRE5TIHNlcnZlciBhZGRyZXNzLiBTZXBhcmF0ZSBtdWxpdHBsZSBETlMgc2VydmVyIGFkZHJlc3NlcyB3aXRoIGNvbW1hcy4KIApydW5jbWQ6CiAgLSBuZXRwbGFuIGFwcGx5

which corresponds to the following script in plain-text format:

#cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

manage_etc_hosts: true
 
write_files:
  - path: /etc/netplan/50-cloud-init.yaml
    permissions: '0600'
    content: |
      network:
        version: 2
        renderer: networkd
        ethernets:
          ens33:
            dhcp4: false # disable DHCP4
            addresses: [10.199.118.245/25]  # Set the static IP address and mask
            routes:
                - to: default
                  via: 10.199.118.253 # Configure gateway
            nameservers:
              addresses: [10.142.7.1, 10.132.7.1] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
 
runcmd:
  - netplan apply