Deploy a Deep Learning VM by Using the kubectl Command in VMware Private AI Foundation with NVIDIA

The VM service in the Supervisor in vSphere with Tanzu enables data scientists and DevOps engineers to deploy and run deep learning VMs by using the Kubernetes API.

As a data scientist or DevOps engineer, you use kubectl to deploy a deep learning VM on the namespace configured by the cloud administrator.

Prerequisites

Verify with the cloud administrator that the following prerequisites are in place for the AI-ready infrastructure.

VMware Private AI Foundation with NVIDIA is deployed and configured. See Deploying VMware Private AI Foundation with NVIDIA.
A content library with deep learning VMs is added to the namespace for AI workloads. See Create a Content Library with Deep Learning VM Images for VMware Private AI Foundation with NVIDIA.

Procedure

kubectl vsphere login --server=SUPERVISOR-CONTROL-PLANE-IP-ADDRESS-or-FQDN --vsphere-username USERNAME

Examine that all required VM resources, such as VM classes and VM images, are in place on the namespace.
See View VM Resources Available on a Namespace in vSphere with Tanzu.

Prepare the YAML file for the deep learning VM.

Use the vm-operator-api, setting the OVF properties as a ConfigMap object.

For example, you can create a YAML specification example-dl-vm.yaml for an example deep learning VM running PyTorch.

apiVersion: vmoperator.vmware.com/v1alpha1
kind: VirtualMachine
metadata:
  name: example-dl-vm
  namespace: vpaif-ns
  labels:
    app: example-dl-app
spec:
  className: gpu-a30
  imageName: vmi-xxxxxxxxxxxxx
  powerState: poweredOn
  storageClass: tanzu-storage-policy
  vmMetadata:
    configMapName: example-dl-vm-config
    transport: OvfEnv

apiVersion: v1
kind: ConfigMap
metadata:
  name: example-dl-vm-config
  namespace: vpaif-ns
data:
  user-data: I2Nsb3VkLWNvbmZpZwogICAgd3JpdGVfZmlsZXM6CiAgICAtIHBhdGg6IC9vcHQvZGx2bS9kbF9hcHAuc2gKICAgICAgcGVybWlzc2lvbnM6ICcwNzU1JwogICAgICBjb250ZW50OiB8CiAgICAgICAgIyEvYmluL2Jhc2gKICAgICAgICBkb2NrZXIgcnVuIC1kIC1wIDg4ODg6ODg4OCBudmNyLmlvL252aWRpYS9weXRvcmNoOjIzLjEwLXB5MyAvdXNyL2xvY2FsL2Jpbi9qdXB5dGVyIGxhYiAtLWFsbG93LXJvb3QgLS1pcD0qIC0tcG9ydD04ODg4IC0tbm8tYnJvd3NlciAtLU5vdGVib29rQXBwLnRva2VuPScnIC0tTm90ZWJvb2tBcHAuYWxsb3dfb3JpZ2luPScqJyAtLW5vdGVib29rLWRpcj0vd29ya3NwYWNl
  vgpu-license: NVIDIA-client-configuration-token
  nvidia-portal-api-key: API-key-from-NVIDIA-licensing-portal
  password: password-for-vmware-user

Note: user-data is the base64 encoded value for the following cloud-init code:

#cloud-config
    write_files:
    - path: /opt/dlvm/dl_app.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch:23.10-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

apiVersion: vmoperator.vmware.com/v1alpha1
kind: VirtualMachineService
metadata:
  name: example-dl-vm
  namespace: vpaif-ns
spec:
  ports:
  - name: ssh
    port: 22
    protocol: TCP
    targetPort: 22
  - name: junyperlab
    port: 8888
    protocol: TCP
    targetPort: 8888
  selector:
    app: example-dl-app
    type: LoadBalancer

Switch to the context of the vSphere namespace created by the cloud administrator.
For example, for a namespace called example-dl-vm-namespace.
```
kubectl config use-context example-dl-vm-namespace
```
Deploy the deep learning VM.
```
kubectl apply -f example-dl-vm.yaml
```

Verify that the VM has been created by running these commands.

kubectl get vm -n example-dl-vm.yaml

kubectl describe virtualmachine example-dl-vm

ping IP_address_returned_by_kubectl_describe

Ping the IP address of the virtual machine assigned by the requested networking service.

To get the public address and the ports for access to the deep learning VM, get the details about the load balancer service that has been created.

kubectl get services

NAME   TYPE           CLUSTER-IP              EXTERNAL-IP          PORT(S)                       AGE
example-dl-vm   LoadBalancer   <internal-ip-address>   <public-IPaddress>   22:30473/TCP,8888:32180/TCP   9m40s

The vGPU guest driver and the specified DL workload is installed the first time you start the deep learning VM.

You can examine the logs or open the JupyterLab notebook that comes with some of the images. See Deep Learning Workloads in VMware Private AI Foundation with NVIDIA.