The VM service in the Supervisor in vSphere with Tanzu enables data scientists and DevOps engineers to deploy and run deep learning VMs by using the Kubernetes API.

As a data scientist or DevOps engineer, you use kubectl to deploy a deep learning VM on the namespace configured by the cloud administrator.

Prerequisites

Verify with the cloud administrator that the following prerequisites are in place for the AI-ready infrastructure.

Procedure

  1. Log in to the Supervisor control plane.
    kubectl vsphere login --server=SUPERVISOR-CONTROL-PLANE-IP-ADDRESS-or-FQDN --vsphere-username USERNAME
  2. Examine that all required VM resources, such as VM classes and VM images, are in place on the namespace.

    See View VM Resources Available on a Namespace in vSphere with Tanzu.

  3. Prepare the YAML file for the deep learning VM.

    Use the vm-operator-api, setting the OVF properties as a ConfigMap object.

    For example, you can create a YAML specification example-dl-vm.yaml for an example deep learning VM running PyTorch.

    apiVersion: vmoperator.vmware.com/v1alpha1
    kind: VirtualMachine
    metadata:
      name: example-dl-vm
      namespace: vpaif-ns
      labels:
        app: example-dl-app
    spec:
      className: gpu-a30
      imageName: vmi-xxxxxxxxxxxxx
      powerState: poweredOn
      storageClass: tanzu-storage-policy
      vmMetadata:
        configMapName: example-dl-vm-config
        transport: OvfEnv
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: example-dl-vm-config
      namespace: vpaif-ns
    data:
      user-data: I2Nsb3VkLWNvbmZpZwogICAgd3JpdGVfZmlsZXM6CiAgICAtIHBhdGg6IC9vcHQvZGx2bS9kbF9hcHAuc2gKICAgICAgcGVybWlzc2lvbnM6ICcwNzU1JwogICAgICBjb250ZW50OiB8CiAgICAgICAgIyEvYmluL2Jhc2gKICAgICAgICBkb2NrZXIgcnVuIC1kIC1wIDg4ODg6ODg4OCBudmNyLmlvL252aWRpYS9weXRvcmNoOjIzLjEwLXB5MyAvdXNyL2xvY2FsL2Jpbi9qdXB5dGVyIGxhYiAtLWFsbG93LXJvb3QgLS1pcD0qIC0tcG9ydD04ODg4IC0tbm8tYnJvd3NlciAtLU5vdGVib29rQXBwLnRva2VuPScnIC0tTm90ZWJvb2tBcHAuYWxsb3dfb3JpZ2luPScqJyAtLW5vdGVib29rLWRpcj0vd29ya3NwYWNl
      vgpu-license: NVIDIA-client-configuration-token
      nvidia-portal-api-key: API-key-from-NVIDIA-licensing-portal
      password: password-for-vmware-user
    Note: user-data is the base64 encoded value for the following cloud-init code:
    #cloud-config
        write_files:
        - path: /opt/dlvm/dl_app.sh
          permissions: '0755'
          content: |
            #!/bin/bash
            docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch:23.10-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
    apiVersion: vmoperator.vmware.com/v1alpha1
    kind: VirtualMachineService
    metadata:
      name: example-dl-vm
      namespace: vpaif-ns
    spec:
      ports:
      - name: ssh
        port: 22
        protocol: TCP
        targetPort: 22
      - name: junyperlab
        port: 8888
        protocol: TCP
        targetPort: 8888
      selector:
        app: example-dl-app
        type: LoadBalancer
  4. Switch to the context of the vSphere namespace created by the cloud administrator.

    For example, for a namespace called example-dl-vm-namespace.

    kubectl config use-context example-dl-vm-namespace
  5. Deploy the deep learning VM.
    kubectl apply -f example-dl-vm.yaml
  6. Verify that the VM has been created by running these commands.
    kubectl get vm -n example-dl-vm.yaml
    kubectl describe virtualmachine example-dl-vm
    ping IP_address_returned_by_kubectl_describe
  7. Ping the IP address of the virtual machine assigned by the requested networking service.

    To get the public address and the ports for access to the deep learning VM, get the details about the load balancer service that has been created.

    kubectl get services
    NAME   TYPE           CLUSTER-IP              EXTERNAL-IP          PORT(S)                       AGE
    example-dl-vm   LoadBalancer   <internal-ip-address>   <public-IPaddress>   22:30473/TCP,8888:32180/TCP   9m40s
    

The vGPU guest driver and the specified DL workload is installed the first time you start the deep learning VM.

You can examine the logs or open the JupyterLab notebook that comes with some of the images. See Deep Learning Workloads in VMware Private AI Foundation with NVIDIA.