Quando si distribuisce un'istanza di Deep Learning VM in vSphere IaaS control plane utilizzando kubectl o direttamente in un cluster vSphere, è necessario compilare le proprietà personalizzate della macchina virtuale.

Per informazioni sulle immagini di Deep Learning VM in VMware Private AI Foundation with NVIDIA, vedere Informazioni sulle immagini di Deep Learning VM in VMware Private AI Foundation with NVIDIA.

Proprietà OVF di Deep Learning VM

Quando si distribuisce un'istanza di Deep Learning VM, è necessario compilare le proprietà della macchina virtuale personalizzate per automatizzare la configurazione del sistema operativo Linux, la distribuzione del driver guest della vGPU, nonché la distribuzione e la configurazione dei container NGC per i carichi di lavoro DL.

L'immagine di Deep Learning VM più recente ha le proprietà OVF seguenti:

Categoria Parametro Etichetta in vSphere Client Descrizione
Proprietà del sistema operativo di base instance-id ID istanza Obbligatorio. ID istanza univoco per la macchina virtuale.

Un ID istanza identifica in modo univoco un'istanza. Quando un ID istanza viene modificato, cloud-init gestisce l'istanza come una nuova istanza ed esegue nuovamente il processo cloud-init.

hostname Nome host Obbligatorio. Nome host dell'appliance.
seedfrom URL da cui effettuare il seeding dei dati dell'istanza Facoltativo. URL da cui estrarre il valore del parametro user-data e dei metadati.
public-keys Chiave pubblica SSH Se specificato, l'istanza popola il valore authorized_keys di SSH dell'utente predefinito con questo valore.
user-data Dati utente codificati

Un set di script o altri metadati che viene inserito nella macchina virtuale al momento del provisioning.

Questa proprietà rappresenta il contenuto effettivo dello script cloud-init. Questo valore deve essere codificato tramite base64.

password Password utente predefinito Obbligatorio. Password dell'account utente vmware predefinito.

Installazione del driver vGPU

vgpu-license Licenza vGPU Obbligatorio. Token di configurazione del client NVIDIA vGPU. Il token viene salvato nel file /etc/nvidia/ClientConfigToken/client_configuration_token.tok.
nvidia-portal-api-key Chiave API portale NVIDIA

Obbligatorio in un ambiente connesso. Chiave API scaricata dal portale delle licenze NVIDIA. La chiave è necessaria per l'installazione del driver guest della vGPU.

vgpu-host-driver-version Versione driver host vGPU Installa direttamente questa versione del driver guest della vGPU.
vgpu-url URL per i download della vGPU air gap

Obbligatorio in un ambiente disconnesso. URL da cui scaricare il driver guest della vGPU. Per informazioni sulla configurazione necessaria del server Web locale, vedere Preparazione di VMware Cloud Foundation per la distribuzione del carico di lavoro di Private AI.

Automazione del carico di lavoro DL registry-uri URI registro Obbligatorio in un ambiente disconnesso o se si intende utilizzare un registro di container privato per evitare di scaricare immagini da Internet. URI di un registro di container privato con le immagini del container del carico di lavoro di deep learning.

Obbligatorio se si fa riferimento a un registro privato in user-data o image-oneliner.

registry-user Nome utente registro Obbligatorio se si utilizza un registro di container privato che richiede l'autenticazione di base.
registry-passwd Password registro Obbligatorio se si utilizza un registro di container privato che richiede l'autenticazione di base.
registry-2-uri URI registro secondario Obbligatorio se si utilizza un secondo registro di container privato basato su Docker che richiede l'autenticazione di base.

Ad esempio, quando si distribuisce un'istanza di Deep Learning VM con il carico di lavoro DL di NVIDIA RAG preinstallato, un'immagine di pgvector viene scaricata da Docker Hub. È possibile utilizzare i parametri registry-2- per ignorare un limite di velocità pull per docker.io.

registry-2-user Nome utente registro secondario Obbligatorio se si utilizza un secondo registro di container privato.
registry-2-passwd Password registro secondario Obbligatorio se si utilizza un secondo registro di container privato.
image-oneliner Comando a una riga codificato Comando bash a una riga che viene eseguito al momento del provisioning della macchina virtuale. Questo valore deve essere codificato tramite base64.

È possibile utilizzare questa proprietà per specificare il container del carico di lavoro DL che si desidera distribuire, ad esempio PyTorch o TensorFlow. Vedere Carichi di lavoro di deep learning in VMware Private AI Foundation with NVIDIA.

Attenzione: Evitare di utilizzare sia user-data sia image-oneliner.
docker-compose-uri File di composizione Docker codificato

Obbligatorio se è necessario un file di composizione di Docker per avviare il container di carichi di lavoro DL. Contenuti del file docker-compose.yaml che verranno inseriti nella macchina virtuale al momento del provisioning dopo l'avvio della macchina virtuale con la GPU abilitata. Questo valore deve essere codificato tramite base64.

config-json config.json codificato Contenuti di un file di configurazione per l'aggiunta dei dettagli seguenti:

Questo valore deve essere codificato tramite base64.

conda-environment-install Installazione dell'ambiente Conda Elenco di ambienti Conda separati da virgole da installare automaticamente al termine della distribuzione della macchina virtuale.

Ambienti disponibili: pytorch2.3_py3.12, pytorch1.13.1_py3.10, tf2.16.1_py3.12 e tf1.15.5_py3.7.

Carichi di lavoro di deep learning in VMware Private AI Foundation with NVIDIA

È possibile eseguire il provisioning di una macchina virtuale di deep learning con un carico di lavoro di deep learning (DL) supportato oltre ai suoi componenti incorporati. I carichi di lavoro DL vengono scaricati dal catalogo NVIDIA NGC e sono ottimizzati per la GPU e convalidati da NVIDIA e VMware by Broadcom.

Per una panoramica delle immagini delle macchine virtuali di deep learning, vedere Informazioni sulle immagini di Deep Learning VM in VMware Private AI Foundation with NVIDIA.

Esempio di CUDA

È possibile utilizzare una macchina virtuale di deep learning con esempi CUDA in esecuzione per esplorare l'aggiunta di un vettore, la simulazione gravitazionale di n-corpi o altri esempi in una macchina virtuale. Vedere la pagina Esempi di CUDA.

Dopo l'avvio, la macchina virtuale di deep learning esegue un carico di lavoro di esempio CUDA per testare il driver guest della vGPU. È possibile esaminare l'output del test nel file /var/log/dl.log.

Tabella 1. Immagine del container di esempio CUDA
Componente Descrizione
Immagine del container
nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag
Ad esempio:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

Per informazioni sulle immagini di container di esempio CUDA supportate per le macchine virtuali di deep learning, vedere Note di rilascio di VMware Deep Learning VM.

Input necessari Per distribuire un carico di lavoro di esempio CUDA, è necessario impostare le proprietà OVF per la macchina virtuale di deep learning nel modo seguente:
  • Utilizzare una delle proprietà seguenti specifiche per l'immagine dell'esempio CUDA.
    • Script cloud-init. Codificarlo nel formato base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          set_proxy "http" "https" "socks5"
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          deploy_dcgm_exporter
      
          echo "Info: running the vectoradd CUDA container"
          CUDA_SAMPLE_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/cuda-sample"
          CUDA_SAMPLE_VERSION="ngc_image_tag"
          docker run -d $CUDA_SAMPLE_IMAGE:$CUDA_SAMPLE_VERSION
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
        
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Ad esempio, per vectoradd-cuda11.7.1-ubi8, specificare lo script seguente in formato base64:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICBzZXRfcHJveHkgImh0dHAiICJodHRwcyIgInNvY2tzNSIKICAgIHRyYXAgJ2Vycm9yX2V4aXQgIlVuZXhwZWN0ZWQgZXJyb3Igb2NjdXJzIGF0IGRsIHdvcmtsb2FkIicgRVJSCiAgICBERUZBVUxUX1JFR19VUkk9Im52Y3IuaW8iCiAgICBSRUdJU1RSWV9VUklfUEFUSD0kKGdyZXAgcmVnaXN0cnktdXJpIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKCiAgICBpZiBbWyAteiAiJFJFR0lTVFJZX1VSSV9QQVRIIiBdXTsgdGhlbgogICAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGlzIG51bGwgb3IgZW1wdHksIHVzZSB0aGUgZGVmYXVsdCB2YWx1ZQogICAgICBSRUdJU1RSWV9VUklfUEFUSD0kREVGQVVMVF9SRUdfVVJJCiAgICAgIGVjaG8gIlJFR0lTVFJZX1VSSV9QQVRIIHdhcyBlbXB0eS4gVXNpbmcgZGVmYXVsdDogJFJFR0lTVFJZX1VSSV9QQVRIIgogICAgZmkKICAgIAogICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBjb250YWlucyAnLycsIGV4dHJhY3QgdGhlIFVSSSBwYXJ0CiAgICBpZiBbWyAkUkVHSVNUUllfVVJJX1BBVEggPT0gKiIvIiogXV07IHRoZW4KICAgICAgUkVHSVNUUllfVVJJPSQoZWNobyAiJFJFR0lTVFJZX1VSSV9QQVRIIiB8IGN1dCAtZCcvJyAtZjEpCiAgICBlbHNlCiAgICAgIFJFR0lTVFJZX1VSST0kUkVHSVNUUllfVVJJX1BBVEgKICAgIGZpCiAgCiAgICBSRUdJU1RSWV9VU0VSTkFNRT0kKGdyZXAgcmVnaXN0cnktdXNlciAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBSRUdJU1RSWV9QQVNTV09SRD0kKGdyZXAgcmVnaXN0cnktcGFzc3dkIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIGlmIFtbIC1uICIkUkVHSVNUUllfVVNFUk5BTUUiICYmIC1uICIkUkVHSVNUUllfUEFTU1dPUkQiIF1dOyB0aGVuCiAgICAgIGRvY2tlciBsb2dpbiAtdSAkUkVHSVNUUllfVVNFUk5BTUUgLXAgJFJFR0lTVFJZX1BBU1NXT1JEICRSRUdJU1RSWV9VUkkKICAgIGVsc2UKICAgICAgZWNobyAiV2FybmluZzogdGhlIHJlZ2lzdHJ5J3MgdXNlcm5hbWUgYW5kIHBhc3N3b3JkIGFyZSBpbnZhbGlkLCBTa2lwcGluZyBEb2NrZXIgbG9naW4uIgogICAgZmkKICAgIAogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIKCiAgICBlY2hvICJJbmZvOiBydW5uaW5nIHRoZSB2ZWN0b3JhZGQgQ1VEQSBjb250YWluZXIiCiAgICBDVURBX1NBTVBMRV9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvY3VkYS1zYW1wbGUiCiAgICBDVURBX1NBTVBMRV9WRVJTSU9OPSJ2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4IgogICAgZG9ja2VyIHJ1biAtZCAkQ1VEQV9TQU1QTEVfSU1BR0U6JENVREFfU0FNUExFX1ZFUlNJT04KCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQogIAogICAgICBjaGVja19wcm90b2NvbCAiJHtIVFRQX1BST1hZX1VSTH0iICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IgogICAgICBjaGVja19wcm90b2NvbCAiJHtIVFRQU19QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKCiAgICAgIGlmICEgZ3JlcCAtcSAnaHR0cF9wcm94eScgL2V0Yy9lbnZpcm9ubWVudDsgdGhlbgogICAgICAgIHN1ZG8gYmFzaCAtYyAnZWNobyAiZXhwb3J0IGh0dHBfcHJveHk9JHtIVFRQX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgaHR0cHNfcHJveHk9JHtIVFRQU19QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IEhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUFNfUFJPWFk9JHtIVFRQU19QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IG5vX3Byb3h5PWxvY2FsaG9zdCwxMjcuMC4wLjEiID4+IC9ldGMvZW52aXJvbm1lbnQnCiAgICAgICAgc291cmNlIC9ldGMvZW52aXJvbm1lbnQKICAgICAgZmkKICAgICAgCiAgICAgICMgQ29uZmlndXJlIERvY2tlciB0byB1c2UgYSBwcm94eQogICAgICBzdWRvIG1rZGlyIC1wIC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZAogICAgICBzdWRvIGJhc2ggLWMgJ2VjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mJwogICAgICBzdWRvIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN1ZG8gc3lzdGVtY3RsIHJlc3RhcnQgZG9ja2VyCgogICAgICBlY2hvICJJbmZvOiBkb2NrZXIgYW5kIHN5c3RlbSBlbnZpcm9ubWVudCBhcmUgbm93IGNvbmZpZ3VyZWQgdG8gdXNlIHRoZSBwcm94eSBzZXR0aW5ncyIKICAgIH0KCiAgICBkZXBsb3lfZGNnbV9leHBvcnRlcigpIHsKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKICAgICAgRENHTV9FWFBPUlRfUFVCTElDPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5leHBvcnRfZGNnbV90b19wdWJsaWMgLy8gZW1wdHknKQoKICAgICAgRENHTV9FWFBPUlRFUl9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvZGNnbS1leHBvcnRlciIKICAgICAgRENHTV9FWFBPUlRFUl9WRVJTSU9OPSIzLjIuNS0zLjEuOC11YnVudHUyMi4wNCIKICAgICAgaWYgWyAteiAiJHtEQ0dNX0VYUE9SVF9QVUJMSUN9IiBdIHx8IFsgIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgIT0gInRydWUiIF07IHRoZW4KICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgbGlzdGVuaW5nIG9ubHkgb24gbG9jYWxob3N0ICgxMjcuMC4wLjE6OTQwMCkiCiAgICAgICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLXAgMTI3LjAuMC4xOjk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGVsc2UKICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgZXhwb3NlZCBvbiBhbGwgbmV0d29yayBpbnRlcmZhY2VzICgwLjAuMC4wOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGZpCiAgICB9

      che corrisponde allo script seguente in formato testo normale:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          set_proxy "http" "https" "socks5"
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          deploy_dcgm_exporter
      
          echo "Info: running the vectoradd CUDA container"
          CUDA_SAMPLE_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/cuda-sample"
          CUDA_SAMPLE_VERSION="vectoradd-cuda11.7.1-ubi8"
          docker run -d $CUDA_SAMPLE_IMAGE:$CUDA_SAMPLE_VERSION
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
        
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Immagine one-liner. Codificarlo nel formato base64
      docker run -d nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag

      Ad esempio, per vectoradd-cuda11.7.1-ubi8, specificare lo script seguente in formato base64:

      ZG9ja2VyIHJ1biAtZCBudmNyLmlvL252aWRpYS9rOHMvY3VkYS1zYW1wbGU6dmVjdG9yYWRkLWN1ZGExMS43LjEtdWJpOA==

      che corrisponde allo script seguente in formato testo normale:

      docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
  • Immettere le proprietà di installazione del driver guest vGPU, ad esempio vgpu-license e nvidia-portal-api-key.
  • Specificare i valori per le proprietà necessarie per un ambiente disconnesso in base alle esigenze.

Vedere Proprietà OVF di Deep Learning VM.

Output
  • Registri di installazione per il driver guest della vGPU in /var/log/vgpu-install.log.

    Per verificare che il driver guest della vGPU sia installato e che la licenza sia allocata, eseguire il comando seguente:

    nvidia-smi -q |grep -i license
  • Registri dello script cloud-init in /var/log/dl.log.

PyTorch

È possibile utilizzare una macchina virtuale di deep learning con una libreria PyTorch per esplorare AI conversazionale, l'elaborazione del linguaggio naturale (NLP) e altri tipi di modelli AI in una macchina virtuale. Vedere la pagina di PyTorch.

Dopo l'avvio, la macchina virtuale di deep learning avvia un'istanza di JupyterLab con i pacchetti PyTorch installati e configurati.

Tabella 2. Immagine del container PyTorch
Componente Descrizione
Immagine del container
nvcr.io/nvidia/pytorch-pb24h1:ngc_image_tag
Ad esempio:
nvcr.io/nvidia/pytorch-pb24h1:24.03.02-py3

Per informazioni sulle immagini dei container PyTorch supportate per le macchine virtuali di deep learning, vedere Note di rilascio di VMware Deep Learning VM.

Input necessari Per distribuire un carico di lavoro PyTorch, è necessario impostare le proprietà OVF per la macchina virtuale di deep learning nel modo seguente:
  • Utilizzare una delle proprietà seguenti specifiche per l'immagine PyTorch.
    • Script cloud-init. Codificarlo nel formato base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the PyTorch container"
          PYTORCH_IMAGE="$REGISTRY_URI_PATH/nvidia/pytorch-pb24h1"
          PYTORCH_VERSION="ngc_image_tag"
          docker run -d --gpus all -p 8888:8888 $PYTORCH_IMAGE:$PYTORCH_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Ad esempio, per pytorch-pb24h1:24.03.02-py3, specificare lo script seguente in formato base 64:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC1wIDg4ODg6ODg4OCAkUkVHSVNUUllfVVJJX1BBVEgvbnZpZGlhL3B5dG9yY2g6MjMuMTAtcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2UKCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9

      che corrisponde allo script seguente in formato testo normale.

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the PyTorch container"
          PYTORCH_IMAGE="$REGISTRY_URI_PATH/nvidia/pytorch-pb24h1"
          PYTORCH_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8888:8888 $PYTORCH_IMAGE:$PYTORCH_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Immagine one-liner. Codificarlo nel formato base64.
      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch-pb24h1:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      Ad esempio, per pytorch-pb24h1:24.03.02-py3, specificare lo script seguente in formato base 64:

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvcHl0b3JjaC1wYjI0aDE6MjQuMDMuMDItcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2U=

      che corrisponde allo script seguente in formato testo normale:

      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch-pb24h1:24.03.02-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Immettere le proprietà di installazione del driver guest vGPU, ad esempio vgpu-license e nvidia-portal-api-key.
  • Specificare i valori per le proprietà necessarie per un ambiente disconnesso in base alle esigenze.

Vedere Proprietà OVF di Deep Learning VM.

Output
  • Registri di installazione per il driver guest della vGPU in /var/log/vgpu-install.log.

    Per verificare che il driver guest della vGPU sia installato, eseguire il comando nvidia-smi.

  • Registri dello script cloud-init in /var/log/dl.log.
  • Container PyTorch.

    Per verificare che il container PyTorch sia in esecuzione, eseguire i comandi sudo docker ps -a e sudo docker logs container_id.

  • Istanza di JupyterLab accessibile all'indirizzo http://dl_vm_ip:8888

    Nel terminale di JupyterLab, verificare che nel notebook siano disponibili le funzionalità seguenti:

    • Per verificare che JupyterLab possa accedere alla risorsa vGPU, eseguire nvidia-smi.
    • Per verificare che i pacchetti relativi a PyTorch siano installati, eseguire pip show.

TensorFlow

È possibile utilizzare un'istanza di Deep Learning VM con una libreria TensorFlow per esplorare AI conversazionale, l'elaborazione del linguaggio naturale (NLP) e altri tipi di modelli AI in una macchina virtuale. Vedere la pagina di TensorFlow.

Dopo l'avvio, la macchina virtuale di deep learning avvia un'istanza di JupyterLab con i pacchetti TensorFlow installati e configurati.

Tabella 3. Immagine del container TensorFlow
Componente Descrizione
Immagine del container
nvcr.io/nvidia/tensorflow-pb24h1:ngc_image_tag

Ad esempio:

nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3

Per informazioni sulle immagini del container TensorFlow supportate per le macchine virtuali di deep learning, vedere Note di rilascio di VMware Deep Learning VM.

Input necessari Per distribuire un carico di lavoro TensorFlow, è necessario impostare le proprietà OVF per la macchina virtuale di deep learning nel modo seguente:
  • Utilizzare una delle proprietà seguenti specifiche per l'immagine di TensorFlow.
    • Script cloud-init. Codificarlo nel formato base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the Tensorflow container"    
          TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
          TENSORFLOW_VERSION="ngc_image_tag"
          docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Ad esempio, per tensorflow-pb24h1:24.03.02-tf2-py3, specificare lo script seguente in formato base64:

      #cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    set -eu
    source /opt/dlvm/utils.sh
    trap 'error_exit "Unexpected error occurs at dl workload"' ERR
    set_proxy "http" "https" "socks5"
    
    DEFAULT_REG_URI="nvcr.io"
    REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

    if [[ -z "$REGISTRY_URI_PATH" ]]; then
      # If REGISTRY_URI_PATH is null or empty, use the default value
      REGISTRY_URI_PATH=$DEFAULT_REG_URI
      echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
    fi
    
    # If REGISTRY_URI_PATH contains '/', extract the URI part
    if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
      REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
    else
      REGISTRY_URI=$REGISTRY_URI_PATH
    fi
  
    REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
      docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
    else
      echo "Warning: the registry's username and password are invalid, Skipping Docker login."
    fi

    deploy_dcgm_exporter

    CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
    enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')

    if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
      # Generate a random jupyter token
      TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
      # Set the token to guestinfo
      vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
      echo "Info: JupyterLab notebook access token, $TOKEN"
    else
      TOKEN=""
    fi

    echo "Info: running the Tensorflow container"    
    TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
    TENSORFLOW_VERSION="24.03.02-tf2-py3"
    docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
    
- path: /opt/dlvm/utils.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    error_exit() {
      echo "Error: $1" >&2
      vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
      exit 1
    }

    check_protocol() {
      local proxy_url=$1
      shift
      local supported_protocols=("$@")
      if [[ -n "${proxy_url}" ]]; then
        local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
        if [ -z "$protocol" ]; then
          echo "No specific protocol provided. Skipping protocol check."
          return 0
        fi
        local protocol_included=false
        for var in "${supported_protocols[@]}"; do
          if [[ "${protocol}" == "${var}" ]]; then
            protocol_included=true
            break
          fi
        done
        if [[ "${protocol_included}" == false ]]; then
          error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
        fi
      fi
    }

    # $@: list of supported protocols
    set_proxy() {
      local supported_protocols=("$@")

      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)

      HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
      HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
      if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
        echo "Info: The config-json was parsed, but no proxy settings were found."
        return 0
      fi

      check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
      check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"

      if ! grep -q 'http_proxy' /etc/environment; then
        sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
        export https_proxy=${HTTPS_PROXY_URL}
        export HTTP_PROXY=${HTTP_PROXY_URL}
        export HTTPS_PROXY=${HTTPS_PROXY_URL}
        export no_proxy=localhost,127.0.0.1" >> /etc/environment'
        source /etc/environment
      fi
      
      # Configure Docker to use a proxy
      sudo mkdir -p /etc/systemd/system/docker.service.d
      sudo bash -c 'echo "[Service]
      Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
      Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
      Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
      sudo systemctl daemon-reload
      sudo systemctl restart docker

      echo "Info: docker and system environment are now configured to use the proxy settings"
    }

    deploy_dcgm_exporter() {
      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')

      DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
      DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
      if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
        echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      else
        echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      fi
    }

      che corrisponde allo script seguente in formato testo normale:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the Tensorflow container"    
          TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
          TENSORFLOW_VERSION="24.03.02-tf2-py3"
          docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Immagine one-liner. Codificarlo nel formato base64.
      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      Ad esempio, per tensorflow-pb24h1:24.03.02, specificare lo script seguente in formato base64:

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvdGVuc29yZmxvdy1wYjI0aDE6MjQuMDMuMDItdGYyLXB5MyAvdXNyL2xvY2FsL2Jpbi9qdXB5dGVyIGxhYiAtLWFsbG93LXJvb3QgLS1pcD0qIC0tcG9ydD04ODg4IC0tbm8tYnJvd3NlciAtLU5vdGVib29rQXBwLnRva2VuPScnIC0tTm90ZWJvb2tBcHAuYWxsb3dfb3JpZ2luPScqJyAtLW5vdGVib29rLWRpcj0vd29ya3NwYWNl

      che corrisponde allo script seguente in formato testo normale:

      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Immettere le proprietà di installazione del driver guest vGPU, ad esempio vgpu-license e nvidia-portal-api-key.
  • Specificare i valori per le proprietà necessarie per un ambiente disconnesso in base alle esigenze.

Vedere Proprietà OVF di Deep Learning VM.

Output
  • Registri di installazione per il driver guest della vGPU in /var/log/vgpu-install.log.

    Per verificare che il driver guest della vGPU sia installato, accedere alla macchina virtuale tramite SSH ed eseguire il comando nvidia-smi.

  • Registri dello script cloud-init in /var/log/dl.log.
  • Container TensorFlow.

    Per verificare che il container TensorFlow sia in esecuzione, eseguire i comandi sudo docker ps -a e sudo docker logs container_id.

  • Istanza di JupyterLab a cui è possibile accedere in http://dl_vm_ip:8888.

    Nel terminale di JupyterLab, verificare che nel notebook siano disponibili le funzionalità seguenti:

    • Per verificare che JupyterLab possa accedere alla risorsa vGPU, eseguire nvidia-smi.
    • Per verificare che i pacchetti relativi a TensorFlow siano installati, eseguire pip show.

DCGM Exporter

È possibile utilizzare una macchina virtuale di deep learning con Data Center GPU Manager (DCGM) Exporter per monitorare l'integrità delle GPU e ottenerne le metriche utilizzate da un carico di lavoro DL, tramite NVIDIA DCGM, Prometheus e Grafana.

Vedere la pagina DCGM Exporter.

In un'istanza di Deep Learning VM eseguire il container DCGM Exporter insieme a un carico di lavoro DL che esegue le operazioni di AI. Dopo l'avvio di Deep Learning VM, DCGM Exporter è pronto a raccogliere le metriche di vGPU ed esportare i dati in un'altra applicazione per ulteriore monitoraggio e visualizzazione. È possibile eseguire il carico di lavoro DL monitorato come parte del processo cloud-init o dalla riga di comando dopo l'avvio della macchina virtuale.

Tabella 4. Immagine del container DCGM Exporter
Componente Descrizione
Immagine del container
nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag

Ad esempio:

nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04

Per informazioni sulle immagini di container di DCGM Exporter supportate per le macchine virtuali di deep learning, vedere Note di rilascio di VMware Deep Learning VM.

Input necessari Per distribuire un carico di lavoro DCGM Exporter, è necessario impostare le proprietà OVF per la macchina virtuale di deep learning nel modo seguente:
  • Utilizzare una delle seguenti proprietà specifiche dell'immagine di DCGM Exporter.
    • Script cloud-init. Codificarlo nel formato base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          echo "Info: running the DCGM Export container"
          deploy_dcgm_exporter
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="ngc_image_tag"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Ad esempio, per una macchina virtuale di deep learning in cui è preinstallata un'istanza di DCGM Exporter dcgm-exporter:3.2.5-3.1.8-ubuntu22.04, specificare lo script seguente in formato base64

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCiAgICAKICAgIERFRkFVTFRfUkVHX1VSST0ibnZjci5pbyIKICAgIFJFR0lTVFJZX1VSSV9QQVRIPSQoZ3JlcCByZWdpc3RyeS11cmkgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQoKICAgIGlmIFtbIC16ICIkUkVHSVNUUllfVVJJX1BBVEgiIF1dOyB0aGVuCiAgICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggaXMgbnVsbCBvciBlbXB0eSwgdXNlIHRoZSBkZWZhdWx0IHZhbHVlCiAgICAgIFJFR0lTVFJZX1VSSV9QQVRIPSRERUZBVUxUX1JFR19VUkkKICAgICAgZWNobyAiUkVHSVNUUllfVVJJX1BBVEggd2FzIGVtcHR5LiBVc2luZyBkZWZhdWx0OiAkUkVHSVNUUllfVVJJX1BBVEgiCiAgICBmaQogICAgCiAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGNvbnRhaW5zICcvJywgZXh0cmFjdCB0aGUgVVJJIHBhcnQKICAgIGlmIFtbICRSRUdJU1RSWV9VUklfUEFUSCA9PSAqIi8iKiBdXTsgdGhlbgogICAgICBSRUdJU1RSWV9VUkk9JChlY2hvICIkUkVHSVNUUllfVVJJX1BBVEgiIHwgY3V0IC1kJy8nIC1mMSkKICAgIGVsc2UKICAgICAgUkVHSVNUUllfVVJJPSRSRUdJU1RSWV9VUklfUEFUSAogICAgZmkKICAKICAgIFJFR0lTVFJZX1VTRVJOQU1FPSQoZ3JlcCByZWdpc3RyeS11c2VyIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIFJFR0lTVFJZX1BBU1NXT1JEPSQoZ3JlcCByZWdpc3RyeS1wYXNzd2QgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgaWYgW1sgLW4gIiRSRUdJU1RSWV9VU0VSTkFNRSIgJiYgLW4gIiRSRUdJU1RSWV9QQVNTV09SRCIgXV07IHRoZW4KICAgICAgZG9ja2VyIGxvZ2luIC11ICRSRUdJU1RSWV9VU0VSTkFNRSAtcCAkUkVHSVNUUllfUEFTU1dPUkQgJFJFR0lTVFJZX1VSSQogICAgZWxzZQogICAgICBlY2hvICJXYXJuaW5nOiB0aGUgcmVnaXN0cnkncyB1c2VybmFtZSBhbmQgcGFzc3dvcmQgYXJlIGludmFsaWQsIFNraXBwaW5nIERvY2tlciBsb2dpbi4iCiAgICBmaQoKICAgIGVjaG8gIkluZm86IHJ1bm5pbmcgdGhlIERDR00gRXhwb3J0IGNvbnRhaW5lciIKICAgIGRlcGxveV9kY2dtX2V4cG9ydGVyCgotIHBhdGg6IC9vcHQvZGx2bS91dGlscy5zaAogIHBlcm1pc3Npb25zOiAnMDc1NScKICBjb250ZW50OiB8CiAgICAjIS9iaW4vYmFzaAogICAgZXJyb3JfZXhpdCgpIHsKICAgICAgZWNobyAiRXJyb3I6ICQxIiA+JjIKICAgICAgdm10b29sc2QgLS1jbWQgImluZm8tc2V0IGd1ZXN0aW5mby52bXNlcnZpY2UuYm9vdHN0cmFwLmNvbmRpdGlvbiBmYWxzZSwgRExXb3JrbG9hZEZhaWx1cmUsICQxIgogICAgICBleGl0IDEKICAgIH0KCiAgICBjaGVja19wcm90b2NvbCgpIHsKICAgICAgbG9jYWwgcHJveHlfdXJsPSQxCiAgICAgIHNoaWZ0CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCiAgICAgIGlmIFtbIC1uICIke3Byb3h5X3VybH0iIF1dOyB0aGVuCiAgICAgICAgbG9jYWwgcHJvdG9jb2w9JChlY2hvICIke3Byb3h5X3VybH0iIHwgYXdrIC1GICc6Ly8nICd7aWYgKE5GID4gMSkgcHJpbnQgJDE7IGVsc2UgcHJpbnQgIiJ9JykKICAgICAgICBpZiBbIC16ICIkcHJvdG9jb2wiIF07IHRoZW4KICAgICAgICAgIGVjaG8gIk5vIHNwZWNpZmljIHByb3RvY29sIHByb3ZpZGVkLiBTa2lwcGluZyBwcm90b2NvbCBjaGVjay4iCiAgICAgICAgICByZXR1cm4gMAogICAgICAgIGZpCiAgICAgICAgbG9jYWwgcHJvdG9jb2xfaW5jbHVkZWQ9ZmFsc2UKICAgICAgICBmb3IgdmFyIGluICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IjsgZG8KICAgICAgICAgIGlmIFtbICIke3Byb3RvY29sfSIgPT0gIiR7dmFyfSIgXV07IHRoZW4KICAgICAgICAgICAgcHJvdG9jb2xfaW5jbHVkZWQ9dHJ1ZQogICAgICAgICAgICBicmVhawogICAgICAgICAgZmkKICAgICAgICBkb25lCiAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2xfaW5jbHVkZWR9IiA9PSBmYWxzZSBdXTsgdGhlbgogICAgICAgICAgZXJyb3JfZXhpdCAiVW5zdXBwb3J0ZWQgcHJvdG9jb2w6ICR7cHJvdG9jb2x9LiBTdXBwb3J0ZWQgcHJvdG9jb2xzIGFyZTogJHtzdXBwb3J0ZWRfcHJvdG9jb2xzWypdfSIKICAgICAgICBmaQogICAgICBmaQogICAgfQoKICAgICMgJEA6IGxpc3Qgb2Ygc3VwcG9ydGVkIHByb3RvY29scwogICAgc2V0X3Byb3h5KCkgewogICAgICBsb2NhbCBzdXBwb3J0ZWRfcHJvdG9jb2xzPSgiJEAiKQoKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKCiAgICAgIEhUVFBfUFJPWFlfVVJMPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5odHRwX3Byb3h5IC8vIGVtcHR5JykKICAgICAgSFRUUFNfUFJPWFlfVVJMPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5odHRwc19wcm94eSAvLyBlbXB0eScpCiAgICAgIGlmIFtbICQ/IC1uZSAwIHx8ICgteiAiJHtIVFRQX1BST1hZX1VSTH0iICYmIC16ICIke0hUVFBTX1BST1hZX1VSTH0iKSBdXTsgdGhlbgogICAgICAgIGVjaG8gIkluZm86IFRoZSBjb25maWctanNvbiB3YXMgcGFyc2VkLCBidXQgbm8gcHJveHkgc2V0dGluZ3Mgd2VyZSBmb3VuZC4iCiAgICAgICAgcmV0dXJuIDAKICAgICAgZmkKCiAgICAgIGNoZWNrX3Byb3RvY29sICIke0hUVFBfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCiAgICAgIGNoZWNrX3Byb3RvY29sICIke0hUVFBTX1BST1hZX1VSTH0iICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IgoKICAgICAgaWYgISBncmVwIC1xICdodHRwX3Byb3h5JyAvZXRjL2Vudmlyb25tZW50OyB0aGVuCiAgICAgICAgc3VkbyBiYXNoIC1jICdlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudCcKICAgICAgICBzb3VyY2UgL2V0Yy9lbnZpcm9ubWVudAogICAgICBmaQogICAgICAKICAgICAgIyBDb25maWd1cmUgRG9ja2VyIHRvIHVzZSBhIHByb3h5CiAgICAgIHN1ZG8gbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIHN1ZG8gYmFzaCAtYyAnZWNobyAiW1NlcnZpY2VdCiAgICAgIEVudmlyb25tZW50PVwiSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiSFRUUFNfUFJPWFk9JHtIVFRQU19QUk9YWV9VUkx9XCIKICAgICAgRW52aXJvbm1lbnQ9XCJOT19QUk9YWT1sb2NhbGhvc3QsMTI3LjAuMC4xXCIiID4gL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kL3Byb3h5LmNvbmYnCiAgICAgIHN1ZG8gc3lzdGVtY3RsIGRhZW1vbi1yZWxvYWQKICAgICAgc3VkbyBzeXN0ZW1jdGwgcmVzdGFydCBkb2NrZXIKCgogICAgICBlY2hvICJJbmZvOiBkb2NrZXIgYW5kIHN5c3RlbSBlbnZpcm9ubWVudCBhcmUgbm93IGNvbmZpZ3VyZWQgdG8gdXNlIHRoZSBwcm94eSBzZXR0aW5ncyIKICAgIH0KCiAgICBkZXBsb3lfZGNnbV9leHBvcnRlcigpIHsKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKICAgICAgRENHTV9FWFBPUlRfUFVCTElDPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5leHBvcnRfZGNnbV90b19wdWJsaWMgLy8gZW1wdHknKQoKICAgICAgRENHTV9FWFBPUlRFUl9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvZGNnbS1leHBvcnRlciIKICAgICAgRENHTV9FWFBPUlRFUl9WRVJTSU9OPSIzLjIuNS0zLjEuOC11YnVudHUyMi4wNCIKICAgICAgaWYgWyAteiAiJHtEQ0dNX0VYUE9SVF9QVUJMSUN9IiBdIHx8IFsgIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgIT0gInRydWUiIF07IHRoZW4KICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgbGlzdGVuaW5nIG9ubHkgb24gbG9jYWxob3N0ICgxMjcuMC4wLjE6OTQwMCkiCiAgICAgICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLXAgMTI3LjAuMC4xOjk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGVsc2UKICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgZXhwb3NlZCBvbiBhbGwgbmV0d29yayBpbnRlcmZhY2VzICgwLjAuMC4wOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGZpCiAgICB9
      che corrisponde allo script seguente in formato testo normale:
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          echo "Info: running the DCGM Export container"
          deploy_dcgm_exporter
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
      Nota: Nello script cloud-init è inoltre possibile aggiungere le istruzioni per l'esecuzione del carico di lavoro DL di cui si desidera misurare le prestazioni della GPU con DCGM Exporter.
    • Immagine one-liner. Codificarlo nel formato base64.
      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag-ubuntu22.04

      Ad esempio, per dcgm-exporter:3.2.5-3.1.8-ubuntu22.04, specificare lo script seguente in formato base64:

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLS1ybSAtcCA5NDAwOjk0MDAgbnZjci5pby9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXI6My4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQ=

      che corrisponde allo script seguente in formato testo normale:

      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
  • Immettere le proprietà di installazione del driver guest vGPU, ad esempio vgpu-license e nvidia-portal-api-key.
  • Specificare i valori per le proprietà necessarie per un ambiente disconnesso in base alle esigenze.

Vedere Proprietà OVF di Deep Learning VM.

Output
  • Registri di installazione per il driver guest della vGPU in /var/log/vgpu-install.log.

    Per verificare che il driver guest della vGPU sia installato, accedere alla macchina virtuale tramite SSH ed eseguire il comando nvidia-smi.

  • Registri dello script cloud-init in /var/log/dl.log.
  • DCGM Exporter a cui è possibile accedere all'indirizzo http://dl_vm_ip:9400.

Nella macchina virtuale di deep learning, eseguire quindi un carico di lavoro DL e visualizzare i dati in un'altra macchina virtuale utilizzando Prometheus all'indirizzo http://visualization_vm_ip:9090 e Grafana all'indirizzo http://visualization_vm_ip:3000.

Esecuzione di un carico di lavoro DL nella macchina virtuale di deep leaning

Eseguire il carico di lavoro DL per cui si desidera raccogliere le metriche vGPU ed esportare i dati in un'altra applicazione per ulteriori informazioni di monitoraggio e visualizzazione.

  1. Accedere alla macchina virtuale di deep learning come vmware tramite SSH.
  2. Eseguire il container per il carico di lavoro DL, estraendolo dal catalogo NVIDIA NGC o da un registro di container locale.

    Ad esempio, per eseguire il comando seguente per l'esecuzione dell'immagine tensorflow-pb24h1:24.03.02-tf2-py3 da NVIDIA NGC:

    docker run -d --gpus all -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
  3. Iniziare a utilizzare il carico di lavoro DL per lo sviluppo di AI.

Installazione di Prometheus e Grafana

È possibile visualizzare e monitorare le metriche della vGPU dalla macchina virtuale di DCGM Exporter in una macchina virtuale che esegue Prometheus e Grafana.

  1. Creare una macchina virtuale di visualizzazione in cui è installato Docker Community Engine.
  2. Connettersi alla macchina virtuale tramite SSH e creare un file YAML per Prometheus.
    $ cat > prometheus.yml << EOF
    global:
      scrape_interval: 15s
      external_labels:
        monitor: 'codelab-monitor'
    scrape_configs:
      - job_name: 'dcgm'
        scrape_interval: 5s
        metrics_path: /metrics
        static_configs:
          - targets: [dl_vm_with_dcgm_exporter_ip:9400']
    EOF
    
  3. Creare un percorso dati.
    $ mkdir grafana_data prometheus_data && chmod 777 grafana_data prometheus_data
    
  4. Creare un file di composizione Docker per installare Prometheus e Grafana.
    $ cat > compose.yaml << EOF
    services:
      prometheus:
        image: prom/prometheus:v2.47.2
        container_name: "prometheus0"
        restart: always
        ports:
          - "9090:9090"
        volumes:
          - "./prometheus.yml:/etc/prometheus/prometheus.yml"
          - "./prometheus_data:/prometheus"
      grafana:
        image: grafana/grafana:10.2.0-ubuntu
        container_name: "grafana0"
        ports:
          - "3000:3000"
        restart: always
        volumes:
          - "./grafana_data:/var/lib/grafana"
    EOF
    
  5. Avviare i container di Prometheus e Grafana.
    $ sudo docker compose up -d        
    

Visualizzazione delle metriche della vGPU in Prometheus

È possibile accedere a Prometheus all'indirizzo http://visualization-vm-ip:9090. È possibile visualizzare le seguenti informazioni sulla vGPU nell'interfaccia utente di Prometheus:

Informazioni Sezione dell'interfaccia utente
Metriche della vGPU non elaborate dalla macchina virtuale di deep learning Stato > Destinazione

Per visualizzare le metriche della vGPU non elaborate dalla macchina virtuale di deep learning, fare clic sulla voce dell'endpoint.

Espressioni del grafico
  1. Nella barra di navigazione principale, fare clic sulla scheda Grafico.
  2. Immettere un'espressione e fare clic su Esegui

Per ulteriori informazioni sull'utilizzo di Prometheus, vedere la documentazione di Prometheus.

Visualizzazione delle metriche in Grafana

Impostare Prometheus come origine dati per Grafana e visualizzare le metriche della vGPU dalla macchina virtuale di deep learning in un dashboard.

  1. Accedere a Grafana all'indirizzo http://visualization-vm-ip:3000 utilizzando il nome utente predefinito admin e la password admin.
  2. Aggiungere Prometheus come prima origine dati connettendosi a visualization-vm-ip nella porta 9090.
  3. Creare un dashboard con le metriche della vGPU.

Per ulteriori informazioni sulla configurazione di un dashboard utilizzando un'origine dati Prometheus, vedere la documentazione di Grafana.

Triton Inference Server

È possibile utilizzare Deep Learning VM con Triton Inference Server per caricare un repository di modelli e ricevere richieste di inferenza.

Vedere la pagina Triton Inference Server.

Tabella 5. Immagine del container Triton Inference Server
Componente Descrizione
Immagine del container
nvcr.io/nvidia/tritonserver-pb24h1:ngc_image_tag

Ad esempio:

nvcr.io/nvidia/tritonserver-pb24h1:24.03.02-py3

Per informazioni sulle immagini dei container Triton Inference Server supportate per Deep Learning VM, vedere Note di rilascio di VMware Deep Learning VM.

Input necessari Per distribuire un carico di lavoro Triton Inference Server, è necessario impostare le proprietà OVF per Deep Learning VM nel modo seguente:
  • Utilizzare una delle proprietà seguenti specifiche per l'immagine di Triton Inference Server.
    • Script cloud-init. Codificarlo nel formato base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          echo "Info: running the Triton Inference Server container"
          TRITON_IMAGE="$REGISTRY_URI_PATH/nvidia/tritonserver-pb24h1"
          TRITON_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/vmware/model_repository:/models $TRITON_IMAGE:$TRITON_VERSION tritonserver --model-repository=/models --model-control-mode=poll
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Ad esempio, per tritonserver:23.10-py3, specificare lo script seguente in formato base64:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIKCiAgICBlY2hvICJJbmZvOiBydW5uaW5nIHRoZSBUcml0b24gSW5mZXJlbmNlIFNlcnZlciBjb250YWluZXIiCiAgICBUUklUT05fSU1BR0U9IiRSRUdJU1RSWV9VUklfUEFUSC9udmlkaWEvdHJpdG9uc2VydmVyLXBiMjRoMSIKICAgIFRSSVRPTl9WRVJTSU9OPSIyNC4wMy4wMi1weTMiCiAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLXAgODAwMDo4MDAwIC1wIDgwMDE6ODAwMSAtcCA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyAkVFJJVE9OX0lNQUdFOiRUUklUT05fVkVSU0lPTiB0cml0b25zZXJ2ZXIgLS1tb2RlbC1yZXBvc2l0b3J5PS9tb2RlbHMgLS1tb2RlbC1jb250cm9sLW1vZGU9cG9sbAogICAgCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBzdWRvIGJhc2ggLWMgJ2VjaG8gImV4cG9ydCBodHRwX3Byb3h5PSR7SFRUUF9QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IGh0dHBzX3Byb3h5PSR7SFRUUFNfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQX1BST1hZPSR7SFRUUF9QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IEhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBub19wcm94eT1sb2NhbGhvc3QsMTI3LjAuMC4xIiA+PiAvZXRjL2Vudmlyb25tZW50JwogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgc3VkbyBta2RpciAtcCAvZXRjL3N5c3RlbWQvc3lzdGVtL2RvY2tlci5zZXJ2aWNlLmQKICAgICAgc3VkbyBiYXNoIC1jICdlY2hvICJbU2VydmljZV0KICAgICAgRW52aXJvbm1lbnQ9XCJIVFRQX1BST1hZPSR7SFRUUF9QUk9YWV9VUkx9XCIKICAgICAgRW52aXJvbm1lbnQ9XCJIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIk5PX1BST1hZPWxvY2FsaG9zdCwxMjcuMC4wLjFcIiIgPiAvZXRjL3N5c3RlbWQvc3lzdGVtL2RvY2tlci5zZXJ2aWNlLmQvcHJveHkuY29uZicKICAgICAgc3VkbyBzeXN0ZW1jdGwgZGFlbW9uLXJlbG9hZAogICAgICBzdWRvIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9CgogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIoKSB7CiAgICAgIENPTkZJR19KU09OX0JBU0U2ND0kKGdyZXAgJ2NvbmZpZy1qc29uJyAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICAgIENPTkZJR19KU09OPSQoZWNobyAke0NPTkZJR19KU09OX0JBU0U2NH0gfCBiYXNlNjQgLS1kZWNvZGUpCiAgICAgIERDR01fRVhQT1JUX1BVQkxJQz0kKGVjaG8gIiR7Q09ORklHX0pTT059IiB8IGpxIC1yICcuZXhwb3J0X2RjZ21fdG9fcHVibGljIC8vIGVtcHR5JykKCiAgICAgIERDR01fRVhQT1JURVJfSU1BR0U9IiRSRUdJU1RSWV9VUklfUEFUSC9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXIiCiAgICAgIERDR01fRVhQT1JURVJfVkVSU0lPTj0iMy4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQiCiAgICAgIGlmIFsgLXogIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgXSB8fCBbICIke0RDR01fRVhQT1JUX1BVQkxJQ30iICE9ICJ0cnVlIiBdOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogbGF1bmNoaW5nIERDR00gRXhwb3J0ZXIgdG8gY29sbGVjdCB2R1BVIG1ldHJpY3MsIGxpc3RlbmluZyBvbmx5IG9uIGxvY2FsaG9zdCAoMTI3LjAuMC4xOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDEyNy4wLjAuMTo5NDAwOjk0MDAgJERDR01fRVhQT1JURVJfSU1BR0U6JERDR01fRVhQT1JURVJfVkVSU0lPTgogICAgICBlbHNlCiAgICAgICAgZWNobyAiSW5mbzogbGF1bmNoaW5nIERDR00gRXhwb3J0ZXIgdG8gY29sbGVjdCB2R1BVIG1ldHJpY3MsIGV4cG9zZWQgb24gYWxsIG5ldHdvcmsgaW50ZXJmYWNlcyAoMC4wLjAuMDo5NDAwKSIKICAgICAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLS1jYXAtYWRkIFNZU19BRE1JTiAtcCA5NDAwOjk0MDAgJERDR01fRVhQT1JURVJfSU1BR0U6JERDR01fRVhQT1JURVJfVkVSU0lPTgogICAgICBmaQogICAgfQ==

      che corrisponde allo script seguente in formato testo normale:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          echo "Info: running the Triton Inference Server container"
          TRITON_IMAGE="$REGISTRY_URI_PATH/nvidia/tritonserver-pb24h1"
          TRITON_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/vmware/model_repository:/models $TRITON_IMAGE:$TRITON_VERSION tritonserver --model-repository=/models --model-control-mode=poll
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Immagine one-liner codificata nel formato base64
      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver-pb24h1:ngc_image_tag tritonserver --model-repository=/models --model-control-mode=poll

      Ad esempio, per tritonserver:24.03.02-py3, specificare lo script seguente in formato base 64:

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tcm0gLXA4MDAwOjgwMDAgLXA4MDAxOjgwMDEgLXA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyBudmNyLmlvL252aWRpYS90cml0b25zZXJ2ZXItcGIyNGgxOjI0LjAzLjAyLXB5MyB0cml0b25zZXJ2ZXIgLS1tb2RlbC1yZXBvc2l0b3J5PS9tb2RlbHMgLS1tb2RlbC1jb250cm9sLW1vZGU9cG9sbA==

      che corrisponde allo script seguente in formato testo normale:

      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver-pb24h1:24.03.02-py3 tritonserver --model-repository=/models --model-control-mode=poll
  • Immettere le proprietà di installazione del driver guest vGPU, ad esempio vgpu-license e nvidia-portal-api-key.
  • Specificare i valori per le proprietà necessarie per un ambiente disconnesso in base alle esigenze.

Vedere Proprietà OVF di Deep Learning VM.

Output
  • Registri di installazione per il driver guest della vGPU in /var/log/vgpu-install.log.

    Per verificare che il driver guest della vGPU sia installato, accedere alla macchina virtuale tramite SSH ed eseguire il comando nvidia-smi.

  • Registri dello script cloud-init in /var/log/dl.log.
  • Container Triton Inference Server.

    Per verificare che il container Triton Inference Server sia in esecuzione, eseguire i comandi docker ps -a e docker logs container_id.

Il repository di modelli per Triton Inference Server è in /home/vmware/model_repository. Inizialmente, il repository di modelli è vuoto e il registro iniziale dell'istanza di Triton Inference Server indica che non è stato caricato alcun modello.

Creazione di un repository di modelli

Per caricare il modello per l'inferenza del modello, eseguire i passaggi seguenti:

  1. Creare il repository di modelli per il proprio modello.

    Vedere la documentazione relativa al repository di modelli di NVIDIA Triton Inference Server.

  2. Copiare il repository di modelli in /home/vmware/model_repository in modo che Triton Inference Server possa caricarlo.
    cp -r path_to_your_created_model_repository/* /home/vmware/model_repository/
    

Invio di richieste di inferenza del modello

  1. Verificare che Triton Inference Server sia integro e che i modelli siano pronti eseguendo questo comando nella console di Deep Learning VM.
    curl -v localhost:8000/v2/simple_sequence
  2. Inviare una richiesta al modello eseguendo questo comando in Deep Learning VM.
    curl -v localhost:8000/v2/models/simple_sequence

Per ulteriori informazioni sull'utilizzo di Triton Inference Server, vedere la documentazione relativa al repository di modelli di NVIDIA Triton Inference Server.

NVIDIA RAG

È possibile utilizzare Deep Learning VM per creare soluzioni RAG (Retrieval Augmented Generation) con un modello Llama2.

Vedere la documentazione NVIDIA RAG Applications Docker Compose (richiede autorizzazioni dell'account specifiche).

Tabella 6. Immagine del container NVIDIA RAG
Componente Descrizione
Immagini e modelli di container
docker-compose-nim-ms.yaml
rag-app-multiturn-chatbot/docker-compose.yaml
nella pipeline di NVIDIA RAG di esempio.

Per informazioni sulle applicazioni container NVIDIA RAG supportate per Deep Learning VM, vedere Note di rilascio di VMware Deep Learning VM.

Input necessari Per distribuire un carico di lavoro NVIDIA RAG, è necessario impostare le proprietà OVF per Deep Learning VM nel modo seguente:
  • Immettere uno script cloud-init. Codificarlo nel formato base64.

    Ad esempio, per la versione 24.08 di NVIDIA RAG, specificare lo script seguente:

    #cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    set -eu
    source /opt/dlvm/utils.sh
    trap 'error_exit "Unexpected error occurs at dl workload"' ERR
    set_proxy "http" "https"
    
    sudo mkdir -p /opt/data/
    sudo chown vmware:vmware /opt/data
    sudo chmod -R 775 /opt/data
    cd /opt/data/

    cat <<EOF > /opt/data/config.json
    {
      "_comment_1": "This provides default support for RAG v24.08: llama3-8b-instruct model",
      "_comment_2": "Update llm_ms_gpu_id: specifies the GPU device ID to make available to the inference server when using multiple GPU",
      "_comment_3": "Update embedding_ms_gpu_id: specifies the GPU ID used for embedding model processing when using multiple GPU",
      "rag": {
        "org_name": "nvidia",
        "org_team_name": "aiworkflows",
        "rag_name": "ai-chatbot-docker-workflow",
        "rag_version": "24.08",
        "rag_app": "rag-app-multiturn-chatbot",
        "nim_model_profile": "auto",
        "llm_ms_gpu_id": "0",
        "embedding_ms_gpu_id": "0",
        "model_directory": "model-cache",
        "ngc_cli_version": "3.41.2"
      }
    }
    EOF

    CONFIG_JSON=$(cat "/opt/data/config.json")
    required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_NAME" "RAG_VERSION" "RAG_APP" "NIM_MODEL_PROFILE" "LLM_MS_GPU_ID" "EMBEDDING_MS_GPU_ID" "MODEL_DIRECTORY" "NGC_CLI_VERSION")

    # Extract rag values from /opt/data/config.json
    for index in "${!required_vars[@]}"; do
      key="${required_vars[$index]}"
      jq_query=".rag.${key,,} | select (.!=null)"
      value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
      if [[ -z "${value}" ]]; then 
        error_exit "${key} is required but not set."
      else
        eval ${key}=\""${value}"\"
      fi
    done

    # Read parameters from config-json to connect DSM PGVector on RAG
    CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    CONFIG_JSON_PGVECTOR=$(echo "${CONFIG_JSON_BASE64}" | base64 -d)
    PGVECTOR_VALUE=$(echo ${CONFIG_JSON_PGVECTOR} | jq -r '.rag.pgvector')
    if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then
      echo "Info: extract DSM PGVector parameters from config-json in XML"
      POSTGRES_USER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $4}')
      POSTGRES_PASSWORD=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $5}')
      POSTGRES_HOST_IP=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $6}')
      POSTGRES_PORT_NUMBER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $7}')
      POSTGRES_DB=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $8}')

      for var in POSTGRES_USER POSTGRES_PASSWORD POSTGRES_HOST_IP POSTGRES_PORT_NUMBER POSTGRES_DB; do
        if [ -z "${!var}" ]; then
          error_exit "${var} is not set."
        fi
      done
    fi

    gpu_info=$(nvidia-smi -L)
    echo "Info: the detected GPU info, $gpu_info"
    if [[ ${NIM_MODEL_PROFILE} == "auto" ]]; then 
      case "${gpu_info}" in
        *A100*)
          NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
          echo "Info: GPU type A100 detected. Setting tensorrt_llm-A100-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *H100*)
          NIM_MODEL_PROFILE="cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1"
          echo "Info: GPU type H100 detected. Setting tensorrt_llm-H100-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *L40S*)
          NIM_MODEL_PROFILE="d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80"
          echo "Info: GPU type L40S detected. Setting tensorrt_llm-L40S-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *)
          NIM_MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
          echo "Info: No supported GPU type detected (A100, H100, L40S). Setting vllm as the default NIM model profile."
          ;;
      esac
    else
      echo "Info: using the NIM model profile provided by the user, $NIM_MODEL_PROFILE"
    fi

    RAG_URI="${ORG_NAME}/${ORG_TEAM_NAME}/${RAG_NAME}:${RAG_VERSION}"
    RAG_FOLDER="${RAG_NAME}_v${RAG_VERSION}"
    NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"

    if [ ! -f .initialize ]; then
      # clean up
      rm -rf compose.env ngc* ${RAG_NAME}* ${MODEL_DIRECTORY}* .initialize

      # install ngc-cli
      wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip -q ngccli_linux.zip
      export PATH=`pwd`/ngc-cli:${PATH}

      APIKEY=""
      DEFAULT_REG_URI="nvcr.io"

      REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      if [[ -z "${REGISTRY_URI_PATH}" ]]; then
        REGISTRY_URI_PATH=${DEFAULT_REG_URI}
        echo "Info: registry uri was empty. Using default: ${REGISTRY_URI_PATH}"
      fi

      if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${DEFAULT_REG_URI}"* ]]; then
        APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      fi

      if [ -z "${APIKEY}" ]; then
          error_exit "No APIKEY found"
      fi

      # config ngc-cli
      mkdir -p ~/.ngc

      cat << EOF > ~/.ngc/config
      [CURRENT]
      apikey = ${APIKEY}
      format_type = ascii
      org = ${ORG_NAME}
      team = ${ORG_TEAM_NAME}
      ace = no-ace
    EOF
      
      # Extract registry URI if path contains '/'
      if [[ ${REGISTRY_URI_PATH} == *"/"* ]]; then
        REGISTRY_URI=$(echo "${REGISTRY_URI_PATH}" | cut -d'/' -f1)
      else
        REGISTRY_URI=${REGISTRY_URI_PATH}
      fi

      REGISTRY_USER=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

      # Docker login if credentials are provided
      if [[ -n "${REGISTRY_USER}" && -n "${APIKEY}" ]]; then
        docker login -u ${REGISTRY_USER} -p ${APIKEY} ${REGISTRY_URI}
      else
        echo "Warning: the ${REGISTRY_URI} registry's username and password are invalid, Skipping Docker login."
      fi

      # DockerHub login for general components
      DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

      DOCKERHUB_URI=${DOCKERHUB_URI:-docker.io}
      if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
        docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD} ${DOCKERHUB_URI}
      else
        echo "Warning: ${DOCKERHUB_URI} not logged in"
      fi

      # Download RAG files
      ngc registry resource download-version ${RAG_URI}

      mkdir -p /opt/data/${MODEL_DIRECTORY}

      # Update the docker-compose YAML files to correct the issue with GPU free/non-free status reporting
      /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/docker-compose-nim-ms.yaml"> docker-compose-nim-ms.json
      jq --arg profile "${NIM_MODEL_PROFILE}" \
         '.services."nemollm-inference".environment.NIM_MANIFEST_ALLOW_UNSAFE = "1" |
          .services."nemollm-inference".environment.NIM_MODEL_PROFILE = $profile |
          .services."nemollm-inference".deploy.resources.reservations.devices[0].device_ids = ["${LLM_MS_GPU_ID:-0}"] |
          del(.services."nemollm-inference".deploy.resources.reservations.devices[0].count)' docker-compose-nim-ms.json > temp.json && mv temp.json docker-compose-nim-ms.json
      /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < docker-compose-nim-ms.json > "${RAG_FOLDER}/docker-compose-nim-ms.yaml"
      rm -rf docker-compose-nim-ms.json

      # Update docker-compose YAML files to config PGVector as the default databse
      /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"> rag-app-multiturn-chatbot.json
      jq '.services."chain-server".environment.APP_VECTORSTORE_NAME = "pgvector" |
         .services."chain-server".environment.APP_VECTORSTORE_URL = "${POSTGRES_HOST_IP:-pgvector}:${POSTGRES_PORT_NUMBER:-5432}" |
         .services."chain-server".environment.POSTGRES_PASSWORD = "${POSTGRES_PASSWORD:-password}" |
         .services."chain-server".environment.POSTGRES_USER = "${POSTGRES_USER:-postgres}" |
         .services."chain-server".environment.POSTGRES_DB = "${POSTGRES_DB:-api}"' rag-app-multiturn-chatbot.json > temp.json && mv temp.json rag-app-multiturn-chatbot.json
      /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-multiturn-chatbot.json > "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"
      rm -rf rag-app-multiturn-chatbot.json

      # config compose.env
      cat << EOF > compose.env
      export MODEL_DIRECTORY="/opt/data/${MODEL_DIRECTORY}"
      export NGC_API_KEY=${APIKEY}
      export USERID=$(id -u)
      export LLM_MS_GPU_ID=${LLM_MS_GPU_ID}
      export EMBEDDING_MS_GPU_ID=${EMBEDDING_MS_GPU_ID}
    EOF

      if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then 
        cat << EOF >> compose.env
        export POSTGRES_HOST_IP="${POSTGRES_HOST_IP}"
        export POSTGRES_PORT_NUMBER="${POSTGRES_PORT_NUMBER}"
        export POSTGRES_PASSWORD="${POSTGRES_PASSWORD}"
        export POSTGRES_USER="${POSTGRES_USER}"
        export POSTGRES_DB="${POSTGRES_DB}"
    EOF
      fi
    
      touch .initialize

      deploy_dcgm_exporter
    fi

    # start NGC RAG
    echo "Info: running the RAG application"
    source compose.env
    if [ -z "${PGVECTOR_VALUE}" ] || [ "${PGVECTOR_VALUE}" = "null" ]; then 
      echo "Info: running the pgvector container as the Vector Database"
      docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim --profile pgvector up -d
    else
      echo "Info: using the provided DSM PGVector as the Vector Database"
      docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim up -d
    fi
    
- path: /opt/dlvm/utils.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    error_exit() {
      echo "Error: $1" >&2
      vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
      exit 1
    }

    check_protocol() {
      local proxy_url=$1
      shift
      local supported_protocols=("$@")
      if [[ -n "${proxy_url}" ]]; then
        local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
        if [ -z "$protocol" ]; then
          echo "No specific protocol provided. Skipping protocol check."
          return 0
        fi
        local protocol_included=false
        for var in "${supported_protocols[@]}"; do
          if [[ "${protocol}" == "${var}" ]]; then
            protocol_included=true
            break
          fi
        done
        if [[ "${protocol_included}" == false ]]; then
          error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
        fi
      fi
    }

    # $@: list of supported protocols
    set_proxy() {
      local supported_protocols=("$@")

      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)

      HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
      HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
      if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
        echo "Info: The config-json was parsed, but no proxy settings were found."
        return 0
      fi

      check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
      check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"

      if ! grep -q 'http_proxy' /etc/environment; then
        sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
        export https_proxy=${HTTPS_PROXY_URL}
        export HTTP_PROXY=${HTTP_PROXY_URL}
        export HTTPS_PROXY=${HTTPS_PROXY_URL}
        export no_proxy=localhost,127.0.0.1" >> /etc/environment'
        source /etc/environment
      fi
      
      # Configure Docker to use a proxy
      sudo mkdir -p /etc/systemd/system/docker.service.d
      sudo bash -c 'echo "[Service]
      Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
      Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
      Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
      sudo systemctl daemon-reload
      sudo systemctl restart docker

      echo "Info: docker and system environment are now configured to use the proxy settings"
    }

    deploy_dcgm_exporter() {
      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')

      DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
      DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
      if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
        echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      else
        echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      fi
    }

    che corrisponde allo script seguente in formato testo normale:

    #cloud-config
    write_files:
    - path: /opt/dlvm/dl_app.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        set -eu
        source /opt/dlvm/utils.sh
        trap 'error_exit "Unexpected error occurs at dl workload"' ERR
        set_proxy "http" "https"
        
        sudo mkdir -p /opt/data/
        sudo chown vmware:vmware /opt/data
        sudo chmod -R 775 /opt/data
        cd /opt/data/
    
        cat <<EOF > /opt/data/config.json
        {
          "_comment_1": "This provides default support for RAG v24.08: llama3-8b-instruct model",
          "_comment_2": "Update llm_ms_gpu_id: specifies the GPU device ID to make available to the inference server when using multiple GPU",
          "_comment_3": "Update embedding_ms_gpu_id: specifies the GPU ID used for embedding model processing when using multiple GPU",
          "rag": {
            "org_name": "nvidia",
            "org_team_name": "aiworkflows",
            "rag_name": "ai-chatbot-docker-workflow",
            "rag_version": "24.08",
            "rag_app": "rag-app-multiturn-chatbot",
            "nim_model_profile": "auto",
            "llm_ms_gpu_id": "0",
            "embedding_ms_gpu_id": "0",
            "model_directory": "model-cache",
            "ngc_cli_version": "3.41.2"
          }
        }
        EOF
    
        CONFIG_JSON=$(cat "/opt/data/config.json")
        required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_NAME" "RAG_VERSION" "RAG_APP" "NIM_MODEL_PROFILE" "LLM_MS_GPU_ID" "EMBEDDING_MS_GPU_ID" "MODEL_DIRECTORY" "NGC_CLI_VERSION")
    
        # Extract rag values from /opt/data/config.json
        for index in "${!required_vars[@]}"; do
          key="${required_vars[$index]}"
          jq_query=".rag.${key,,} | select (.!=null)"
          value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
          if [[ -z "${value}" ]]; then 
            error_exit "${key} is required but not set."
          else
            eval ${key}=\""${value}"\"
          fi
        done
    
        # Read parameters from config-json to connect DSM PGVector on RAG
        CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
        CONFIG_JSON_PGVECTOR=$(echo "${CONFIG_JSON_BASE64}" | base64 -d)
        PGVECTOR_VALUE=$(echo ${CONFIG_JSON_PGVECTOR} | jq -r '.rag.pgvector')
        if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then
          echo "Info: extract DSM PGVector parameters from config-json in XML"
          POSTGRES_USER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $4}')
          POSTGRES_PASSWORD=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $5}')
          POSTGRES_HOST_IP=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $6}')
          POSTGRES_PORT_NUMBER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $7}')
          POSTGRES_DB=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $8}')
    
          for var in POSTGRES_USER POSTGRES_PASSWORD POSTGRES_HOST_IP POSTGRES_PORT_NUMBER POSTGRES_DB; do
            if [ -z "${!var}" ]; then
              error_exit "${var} is not set."
            fi
          done
        fi
    
        gpu_info=$(nvidia-smi -L)
        echo "Info: the detected GPU info, $gpu_info"
        if [[ ${NIM_MODEL_PROFILE} == "auto" ]]; then 
          case "${gpu_info}" in
            *A100*)
              NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
              echo "Info: GPU type A100 detected. Setting tensorrt_llm-A100-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *H100*)
              NIM_MODEL_PROFILE="cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1"
              echo "Info: GPU type H100 detected. Setting tensorrt_llm-H100-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *L40S*)
              NIM_MODEL_PROFILE="d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80"
              echo "Info: GPU type L40S detected. Setting tensorrt_llm-L40S-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *)
              NIM_MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
              echo "Info: No supported GPU type detected (A100, H100, L40S). Setting vllm as the default NIM model profile."
              ;;
          esac
        else
          echo "Info: using the NIM model profile provided by the user, $NIM_MODEL_PROFILE"
        fi
    
        RAG_URI="${ORG_NAME}/${ORG_TEAM_NAME}/${RAG_NAME}:${RAG_VERSION}"
        RAG_FOLDER="${RAG_NAME}_v${RAG_VERSION}"
        NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"
    
        if [ ! -f .initialize ]; then
          # clean up
          rm -rf compose.env ngc* ${RAG_NAME}* ${MODEL_DIRECTORY}* .initialize
    
          # install ngc-cli
          wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip -q ngccli_linux.zip
          export PATH=`pwd`/ngc-cli:${PATH}
    
          APIKEY=""
          DEFAULT_REG_URI="nvcr.io"
    
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -z "${REGISTRY_URI_PATH}" ]]; then
            REGISTRY_URI_PATH=${DEFAULT_REG_URI}
            echo "Info: registry uri was empty. Using default: ${REGISTRY_URI_PATH}"
          fi
    
          if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${DEFAULT_REG_URI}"* ]]; then
            APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          fi
    
          if [ -z "${APIKEY}" ]; then
              error_exit "No APIKEY found"
          fi
    
          # config ngc-cli
          mkdir -p ~/.ngc
    
          cat << EOF > ~/.ngc/config
          [CURRENT]
          apikey = ${APIKEY}
          format_type = ascii
          org = ${ORG_NAME}
          team = ${ORG_TEAM_NAME}
          ace = no-ace
        EOF
          
          # Extract registry URI if path contains '/'
          if [[ ${REGISTRY_URI_PATH} == *"/"* ]]; then
            REGISTRY_URI=$(echo "${REGISTRY_URI_PATH}" | cut -d'/' -f1)
          else
            REGISTRY_URI=${REGISTRY_URI_PATH}
          fi
    
          REGISTRY_USER=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
          # Docker login if credentials are provided
          if [[ -n "${REGISTRY_USER}" && -n "${APIKEY}" ]]; then
            docker login -u ${REGISTRY_USER} -p ${APIKEY} ${REGISTRY_URI}
          else
            echo "Warning: the ${REGISTRY_URI} registry's username and password are invalid, Skipping Docker login."
          fi
    
          # DockerHub login for general components
          DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
          DOCKERHUB_URI=${DOCKERHUB_URI:-docker.io}
          if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
            docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD} ${DOCKERHUB_URI}
          else
            echo "Warning: ${DOCKERHUB_URI} not logged in"
          fi
    
          # Download RAG files
          ngc registry resource download-version ${RAG_URI}
    
          mkdir -p /opt/data/${MODEL_DIRECTORY}
    
          # Update the docker-compose YAML files to correct the issue with GPU free/non-free status reporting
          /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/docker-compose-nim-ms.yaml"> docker-compose-nim-ms.json
          jq --arg profile "${NIM_MODEL_PROFILE}" \
             '.services."nemollm-inference".environment.NIM_MANIFEST_ALLOW_UNSAFE = "1" |
              .services."nemollm-inference".environment.NIM_MODEL_PROFILE = $profile |
              .services."nemollm-inference".deploy.resources.reservations.devices[0].device_ids = ["${LLM_MS_GPU_ID:-0}"] |
              del(.services."nemollm-inference".deploy.resources.reservations.devices[0].count)' docker-compose-nim-ms.json > temp.json && mv temp.json docker-compose-nim-ms.json
          /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < docker-compose-nim-ms.json > "${RAG_FOLDER}/docker-compose-nim-ms.yaml"
          rm -rf docker-compose-nim-ms.json
    
          # Update docker-compose YAML files to config PGVector as the default databse
          /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"> rag-app-multiturn-chatbot.json
          jq '.services."chain-server".environment.APP_VECTORSTORE_NAME = "pgvector" |
             .services."chain-server".environment.APP_VECTORSTORE_URL = "${POSTGRES_HOST_IP:-pgvector}:${POSTGRES_PORT_NUMBER:-5432}" |
             .services."chain-server".environment.POSTGRES_PASSWORD = "${POSTGRES_PASSWORD:-password}" |
             .services."chain-server".environment.POSTGRES_USER = "${POSTGRES_USER:-postgres}" |
             .services."chain-server".environment.POSTGRES_DB = "${POSTGRES_DB:-api}"' rag-app-multiturn-chatbot.json > temp.json && mv temp.json rag-app-multiturn-chatbot.json
          /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-multiturn-chatbot.json > "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"
          rm -rf rag-app-multiturn-chatbot.json
    
          # config compose.env
          cat << EOF > compose.env
          export MODEL_DIRECTORY="/opt/data/${MODEL_DIRECTORY}"
          export NGC_API_KEY=${APIKEY}
          export USERID=$(id -u)
          export LLM_MS_GPU_ID=${LLM_MS_GPU_ID}
          export EMBEDDING_MS_GPU_ID=${EMBEDDING_MS_GPU_ID}
        EOF
    
          if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then 
            cat << EOF >> compose.env
            export POSTGRES_HOST_IP="${POSTGRES_HOST_IP}"
            export POSTGRES_PORT_NUMBER="${POSTGRES_PORT_NUMBER}"
            export POSTGRES_PASSWORD="${POSTGRES_PASSWORD}"
            export POSTGRES_USER="${POSTGRES_USER}"
            export POSTGRES_DB="${POSTGRES_DB}"
        EOF
          fi
        
          touch .initialize
    
          deploy_dcgm_exporter
        fi
    
        # start NGC RAG
        echo "Info: running the RAG application"
        source compose.env
        if [ -z "${PGVECTOR_VALUE}" ] || [ "${PGVECTOR_VALUE}" = "null" ]; then 
          echo "Info: running the pgvector container as the Vector Database"
          docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim --profile pgvector up -d
        else
          echo "Info: using the provided DSM PGVector as the Vector Database"
          docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim up -d
        fi
        
    - path: /opt/dlvm/utils.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        error_exit() {
          echo "Error: $1" >&2
          vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
          exit 1
        }
    
        check_protocol() {
          local proxy_url=$1
          shift
          local supported_protocols=("$@")
          if [[ -n "${proxy_url}" ]]; then
            local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
            if [ -z "$protocol" ]; then
              echo "No specific protocol provided. Skipping protocol check."
              return 0
            fi
            local protocol_included=false
            for var in "${supported_protocols[@]}"; do
              if [[ "${protocol}" == "${var}" ]]; then
                protocol_included=true
                break
              fi
            done
            if [[ "${protocol_included}" == false ]]; then
              error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
            fi
          fi
        }
    
        # $@: list of supported protocols
        set_proxy() {
          local supported_protocols=("$@")
    
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
    
          HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
          HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
          if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
            echo "Info: The config-json was parsed, but no proxy settings were found."
            return 0
          fi
    
          check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
          check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
    
          if ! grep -q 'http_proxy' /etc/environment; then
            sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
            export https_proxy=${HTTPS_PROXY_URL}
            export HTTP_PROXY=${HTTP_PROXY_URL}
            export HTTPS_PROXY=${HTTPS_PROXY_URL}
            export no_proxy=localhost,127.0.0.1" >> /etc/environment'
            source /etc/environment
          fi
          
          # Configure Docker to use a proxy
          sudo mkdir -p /etc/systemd/system/docker.service.d
          sudo bash -c 'echo "[Service]
          Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
          Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
          Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
          sudo systemctl daemon-reload
          sudo systemctl restart docker
    
          echo "Info: docker and system environment are now configured to use the proxy settings"
        }
    
        deploy_dcgm_exporter() {
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
    
          DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
          DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
          if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
            echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
            docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
          else
            echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
            docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
          fi
        }
  • Immettere le proprietà di installazione del driver guest vGPU, ad esempio vgpu-license e nvidia-portal-api-key.
  • Specificare i valori per le proprietà necessarie per un ambiente disconnesso in base alle esigenze.

Vedere Proprietà OVF di Deep Learning VM.

Output
  • Registri di installazione per il driver guest della vGPU in /var/log/vgpu-install.log.

    Per verificare che il driver guest della vGPU sia installato, accedere alla macchina virtuale tramite SSH ed eseguire il comando nvidia-smi.

  • Registri dello script cloud-init in /var/log/dl.log.

    Per tenere traccia dello stato di avanzamento della distribuzione, eseguire tail -f /var/log/dl.log .

  • Applicazione Web chatbot di esempio accessibile all'indirizzo http://dl_vm_ip:3001

    È possibile caricare la propria knowledge base.

Assegnazione di un indirizzo IP statico a una macchina virtuale di deep learning in VMware Private AI Foundation with NVIDIA

Per impostazione predefinita, le immagini della macchina virtuale di deep learning sono configurate con l'assegnazione dell'indirizzo DHCP. Se si desidera distribuire una macchina virtuale di deep learning con un indirizzo IP statico direttamente in un cluster vSphere, è necessario aggiungere ulteriore codice alla sezione cloud-init.

In vSphere with Tanzu, l'assegnazione dell'indirizzo IP è determinata dalla configurazione di rete per il supervisore in NSX.

Procedura

  1. Creare uno script cloud-init in formato testo normale per il carico di lavoro DL che si intende utilizzare.
  2. Inserire il codice aggiuntivo seguente nello script cloud-init.
    #cloud-config
    <instructions_for_your_DL_workload>
    
    manage_etc_hosts: true
     
    write_files:
      - path: /etc/netplan/50-cloud-init.yaml
        permissions: '0600'
        content: |
          network:
            version: 2
            renderer: networkd
            ethernets:
              ens33:
                dhcp4: false # disable DHCP4
                addresses: [x.x.x.x/x]  # Set the static IP address and mask
                routes:
                    - to: default
                      via: x.x.x.x # Configure gateway
                nameservers:
                  addresses: [x.x.x.x, x.x.x.x] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
     
    runcmd:
      - netplan apply
  3. Codificare lo script cloud-init risultante in formato base64.
  4. Impostare lo script cloud-init risultante in formato base64 come valore per il parametro OVF user-data dell'immagine della macchina virtuale di deep learning.

Esempio: Assegnazione di un indirizzo IP statico a un carico di lavoro di esempio CUDA

Per una macchina virtuale di deep learning di esempio con un carico di lavoro DL di esempio di CUDA:

Elemento macchina virtuale di deep learning Valore di esempio
Immagine del carico di lavoro DL nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
Indirizzo IP 10.199.118.245
Prefisso subnet /25
Gateway 10.199.118.253
Server DNS
  • 10.142.7.1
  • 10.132.7.1

si specifica il codice cloud-init seguente:

I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIG52Y3IuaW8vbnZpZGlhL2s4cy9jdWRhLXNhbXBsZTp2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4CgptYW5hZ2VfZXRjX2hvc3RzOiB0cnVlCiAKd3JpdGVfZmlsZXM6CiAgLSBwYXRoOiAvZXRjL25ldHBsYW4vNTAtY2xvdWQtaW5pdC55YW1sCiAgICBwZXJtaXNzaW9uczogJzA2MDAnCiAgICBjb250ZW50OiB8CiAgICAgIG5ldHdvcms6CiAgICAgICAgdmVyc2lvbjogMgogICAgICAgIHJlbmRlcmVyOiBuZXR3b3JrZAogICAgICAgIGV0aGVybmV0czoKICAgICAgICAgIGVuczMzOgogICAgICAgICAgICBkaGNwNDogZmFsc2UgIyBkaXNhYmxlIERIQ1A0CiAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE5OS4xMTguMjQ1LzI1XSAgIyBTZXQgdGhlIHN0YXRpYyBJUCBhZGRyZXNzIGFuZCBtYXNrCiAgICAgICAgICAgIHJvdXRlczoKICAgICAgICAgICAgICAgIC0gdG86IGRlZmF1bHQKICAgICAgICAgICAgICAgICAgdmlhOiAxMC4xOTkuMTE4LjI1MyAjIENvbmZpZ3VyZSBnYXRld2F5CiAgICAgICAgICAgIG5hbWVzZXJ2ZXJzOgogICAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE0Mi43LjEsIDEwLjEzMi43LjFdICMgUHJvdmlkZSB0aGUgRE5TIHNlcnZlciBhZGRyZXNzLiBTZXBhcmF0ZSBtdWxpdHBsZSBETlMgc2VydmVyIGFkZHJlc3NlcyB3aXRoIGNvbW1hcy4KIApydW5jbWQ6CiAgLSBuZXRwbGFuIGFwcGx5

che corrisponde allo script seguente in formato testo normale:

#cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

manage_etc_hosts: true
 
write_files:
  - path: /etc/netplan/50-cloud-init.yaml
    permissions: '0600'
    content: |
      network:
        version: 2
        renderer: networkd
        ethernets:
          ens33:
            dhcp4: false # disable DHCP4
            addresses: [10.199.118.245/25]  # Set the static IP address and mask
            routes:
                - to: default
                  via: 10.199.118.253 # Configure gateway
            nameservers:
              addresses: [10.142.7.1, 10.132.7.1] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
 
runcmd:
  - netplan apply

Configurazione di un'stanza di Deep Learning VM con un server proxy

Per connettere l'istanza di Deep Learning VM a Internet in un ambiente disconnesso in cui l'accesso a Internet viene eseguito tramite un server proxy, è necessario specificare i dettagli del server proxy nel file config.json nella macchina virtuale.

Procedura

  1. Creare un file JSON con le proprietà per il server proxy.
    Server proxy che non richiede l'autenticazione
    {  
      "http_proxy": "protocol://ip-address-or-fqdn:port",
      "https_proxy": "protocol://ip-address-or-fqdn:port"
    }
    Server proxy che richiede l'autenticazione
    {  
      "http_proxy": "protocol://username:password@ip-address-or-fqdn:port",
      "https_proxy": "protocol://username:password@ip-address-or-fqdn:port"
    }

    dove:

    • protocol è il protocollo di comunicazione utilizzato dal server proxy, ad esempio http o https.
    • username e password sono le credenziali per l'autenticazione nel server proxy. Se il server proxy non richiede l'autenticazione, ignorare questi parametri.
    • ip-address-or-fqdn: indirizzo IP o nome host del server proxy.
    • port: numero della porta in cui il server proxy è in ascolto delle richieste in arrivo.
  2. Codificare il codice JSON risultante in formato base64.
  3. Quando si distribuisce l'immagine di Deep Learning VM, aggiungere il valore codificato alla proprietà OVF config-json.