Il servizio macchina virtuale nel supervisore in vSphere IaaS Control Plane consente ai tecnici DevOps di distribuire ed eseguire istanze di Deep Learning VM utilizzando l'API Kubernetes.

In qualità di tecnico DevOps, utilizzare kubectl per distribuire un'istanza di Deep Learning VM nello spazio dei nomi configurato dall'amministratore del cloud.

Per informazioni sulle immagini di Deep Learning VM in VMware Private AI Foundation with NVIDIA, vedere Informazioni sulle immagini delle macchine virtuali di deep learning in VMware Private AI Foundation with NVIDIA.

La distribuzione di un'istanza di Deep Learning VM con NVIDIA RAG richiede un database vettore, ad esempio un database PostgreSQL con pgvector in VMware Data Services Manager. Per informazioni sulla distribuzione di tale database e sulla sua integrazione in Deep Learning VM, vedere Distribuzione di una macchina virtuale di deep learning con un carico di lavoro RAG.

Prerequisiti

Verificare con l'amministratore del cloud che VMware Private AI Foundation with NVIDIA sia distribuito e configurato. Vedere Preparazione di VMware Cloud Foundation per la distribuzione del carico di lavoro di Private AI.

Procedura

  1. Accedere al piano di controllo del supervisore.
    kubectl vsphere login --server=SUPERVISOR-CONTROL-PLANE-IP-ADDRESS-or-FQDN --vsphere-username USERNAME
  2. Verificare che tutte le risorse macchina virtuale necessarie, ad esempio le classi di macchine virtuali e le immagini di macchine virtuali, siano presenti nello spazio dei nomi.
  3. Preparare il file YAML per la macchina virtuale di deep learning.

    Utilizzare vm-operator-api, impostando le proprietà OVF come oggetto ConfigMap. Per informazioni sulle proprietà OVF disponibili, vedere Proprietà OVF delle macchine virtuali di deep learning.

    Ad esempio, è possibile creare una specifica YAML example-dl-vm.yaml per un'istanza di Deep Learning VM di esempio che esegue PyTorch in un ambiente connesso.

    apiVersion: vmoperator.vmware.com/v1alpha1
    kind: VirtualMachine
    metadata:
      name: example-dl-vm
      namespace: example-dl-vm-namespace
      labels:
        app: example-dl-app
    spec:
      className: gpu-a100
      imageName: vmi-xxxxxxxxxxxxx
      powerState: poweredOn
      storageClass: tanzu-storage-policy
      vmMetadata:
        configMapName: example-dl-vm-config
        transport: OvfEnv
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: example-dl-vm-config
      namespace: example-dl-vm-namespace
    data:
      user-data: I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC1wIDg4ODg6ODg4OCAkUkVHSVNUUllfVVJJX1BBVEgvbnZpZGlhL3B5dG9yY2g6MjMuMTAtcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2UKCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9
      vgpu-license: NVIDIA-client-configuration-token
      nvidia-portal-api-key: API-key-from-NVIDIA-licensing-portal
      password: password-for-vmware-user
    Nota: user-data è il valore codificato in base64 per il codice cloud-init seguente:
    #cloud-config
    write_files:
    - path: /opt/dlvm/dl_app.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        set -eu
        source /opt/dlvm/utils.sh
        trap 'error_exit "Unexpected error occurs at dl workload"' ERR
        set_proxy "http" "https" "socks5"
    
        DEFAULT_REG_URI="nvcr.io"
        REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
        if [[ -z "$REGISTRY_URI_PATH" ]]; then
          # If REGISTRY_URI_PATH is null or empty, use the default value
          REGISTRY_URI_PATH=$DEFAULT_REG_URI
          echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
        fi
        
        # If REGISTRY_URI_PATH contains '/', extract the URI part
        if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
          REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
        else
          REGISTRY_URI=$REGISTRY_URI_PATH
        fi
      
        REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
        REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
        if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
          docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
        else
          echo "Warning: the registry's username and password are invalid, Skipping Docker login."
        fi
    
        docker run -d --gpus all -p 8888:8888 $REGISTRY_URI_PATH/nvidia/pytorch:pytorch:23.10-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
    
    - path: /opt/dlvm/utils.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        error_exit() {
          echo "Error: $1" >&2
          vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
          exit 1
        }
    
        check_protocol() {
          local proxy_url=$1
          shift
          local supported_protocols=("$@")
          if [[ -n "${proxy_url}" ]]; then
            local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
            if [ -z "$protocol" ]; then
              echo "No specific protocol provided. Skipping protocol check."
              return 0
            fi
            local protocol_included=false
            for var in "${supported_protocols[@]}"; do
              if [[ "${protocol}" == "${var}" ]]; then
                protocol_included=true
                break
              fi
            done
            if [[ "${protocol_included}" == false ]]; then
              error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
            fi
          fi
        }
    
        # $@: list of supported protocols
        set_proxy() {
          local supported_protocols=("$@")
    
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
    
          HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
          HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
          if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
            echo "Info: The config-json was parsed, but no proxy settings were found."
            return 0
          fi
    
          check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
          check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
    
          if ! grep -q 'http_proxy' /etc/environment; then
            echo "export http_proxy=${HTTP_PROXY_URL}
            export https_proxy=${HTTPS_PROXY_URL}
            export HTTP_PROXY=${HTTP_PROXY_URL}
            export HTTPS_PROXY=${HTTPS_PROXY_URL}
            export no_proxy=localhost,127.0.0.1" >> /etc/environment
            source /etc/environment
          fi
          
          # Configure Docker to use a proxy
          mkdir -p /etc/systemd/system/docker.service.d
          echo "[Service]
          Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
          Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
          Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf
          systemctl daemon-reload
          systemctl restart docker
    
          echo "Info: docker and system environment are now configured to use the proxy settings"
        }
    apiVersion: vmoperator.vmware.com/v1alpha1
    kind: VirtualMachineService
    metadata:
      name: example-dl-vm
      namespace: example-dl-vm-namespace
    spec:
      ports:
      - name: ssh
        port: 22
        protocol: TCP
        targetPort: 22
      - name: junyperlab
        port: 8888
        protocol: TCP
        targetPort: 8888
      selector:
        app: example-dl-app
      type: LoadBalancer
  4. Passare al contesto dello spazio dei nomi vSphere creato dall'amministratore del cloud.
    Ad esempio, per uno spazio dei nomi denominato example-dl-vm-namespace:
    kubectl config use-context example-dl-vm-namespace
  5. Distribuire la macchina virtuale di deep learning.
    kubectl apply -f example-dl-vm.yaml
  6. Verificare che la macchina virtuale sia stata creata eseguendo questi comandi.
    kubectl get vm -n example-dl-vm-namespace
    kubectl describe virtualmachine example-dl-vm
  7. Eseguire il ping dell'indirizzo IP della macchina virtuale assegnato dal servizio di rete richiesto.

    Per ottenere l'indirizzo pubblico e le porte per l'accesso alla macchina virtuale di deep learning, recuperare i dettagli relativi al servizio di bilanciamento del carico creato.

    kubectl get services
    NAME   TYPE           CLUSTER-IP              EXTERNAL-IP          PORT(S)                       AGE
    example-dl-vm   LoadBalancer   <internal-ip-address>   <public-IPaddress>   22:30473/TCP,8888:32180/TCP   9m40s
    

risultati

Il driver guest della vGPU e il carico di lavoro DL specificato vengono installati la prima volta che si avvia la macchina virtuale di deep learning.

Operazioni successive