Lorsque vous déployez une VM à apprentissage profond dans vSphere IaaS control plane à l'aide de kubectl ou directement sur un cluster vSphere, vous devez renseigner les propriétés personnalisées de la VM.

Pour plus d'informations sur les images de VM à apprentissage profond dans VMware Private AI Foundation with NVIDIA, reportez-vous à la section À propos des images de VM à apprentissage profond dans VMware Private AI Foundation with NVIDIA.

Propriétés OVF des VM à apprentissage profond

Lorsque vous déployez une VM à apprentissage profond, vous devez remplir des propriétés de VM personnalisées pour automatiser la configuration du système d'exploitation Linux, le déploiement du pilote invité vGPU, ainsi que le déploiement et la configuration de conteneurs NGC pour les charges de travail DL.

La dernière image de VM à apprentissage profond dispose des propriétés OVF suivantes :

Catégorie Paramètre Étiquette dans vSphere Client Description
Propriétés du système d'exploitation de base instance-id ID de l'instance Requis. ID d'instance unique de l'instance de VM.

Un ID d'instance identifie de manière unique une instance. Lorsqu'un ID d'instance change, cloud-init traite l'instance comme une nouvelle instance et réexécute le processus cloud-init.

hostname Nom d'hôte Requis. Nom d'hôte du dispositif.
seedfrom URL à partir de laquelle amorcer les données de l'instance Facultatif. URL à partir de laquelle extraire la valeur du paramètre user-data et les métadonnées.
public-keys Clé publique SSH Si ce paramètre est fourni, l'instance renseigne le authorized_keys SSH de l'utilisateur par défaut avec cette valeur.
user-data Paramètre user-data codé

Ensemble de scripts ou d'autres métadonnées qui sont insérés dans la VM lors du provisionnement.

Cette propriété est le contenu réel du script cloud-init. Cette valeur doit être codée en base64.

password Mot de passe de l'utilisateur par défaut Requis. Mot de passe du compte d'utilisateur vmware par défaut.

Installation du pilote vGPU

vgpu-license Licence vGPU Requis. Jeton de configuration du client NVIDIA vGPU. Le jeton est enregistré dans le fichier /etc/nvidia/ClientConfigToken/client_configuration_token.tok.
nvidia-portal-api-key Clé API du portail NVIDIA

Requis dans un environnement connecté. Clé API que vous avez téléchargée à partir du portail de licences NVIDIA. La clé est requise pour l'installation du pilote invité vGPU.

vgpu-host-driver-version Version du pilote d'hôte vGPU Installez directement cette version du pilote invité vGPU.
vgpu-url URL pour les téléchargements de vGPU isolés

Requis dans un environnement déconnecté. URL à partir de laquelle télécharger le pilote invité vGPU. Pour plus d'informations sur la configuration requise du serveur Web local, reportez-vous à la section Préparation de VMware Cloud Foundation pour le déploiement de charges de travail Private AI.

Automatisation de la charge de travail DL registry-uri URI du registre Requis dans un environnement déconnecté ou si vous prévoyez d'utiliser un registre de conteneur privé pour éviter de télécharger des images depuis Internet. URI d'un registre de conteneur privé avec les images de conteneur de charges de travail d'apprentissage profond.

Requis si vous faites référence à un registre privé dans user-data ou image-oneliner.

registry-user Nom d'utilisateur du registre Requis si vous utilisez un registre de conteneur privé qui nécessite une authentification de base.
registry-passwd Mot de passe du registre Requis si vous utilisez un registre de conteneur privé qui nécessite une authentification de base.
registry-2-uri URI du registre secondaire Requis si vous utilisez un deuxième registre de conteneur privé basé sur Docker et s'il nécessite l'authentification de base.

Par exemple, lors du déploiement d'une VM à apprentissage profond avec la charge de travail NVIDIA RAG DL préinstallée, une image pgvector est téléchargée à partir de Docker Hub. Vous pouvez utiliser les paramètres registry-2- pour contourner une limite de débit d'extraction pour docker.io.

registry-2-user Nom d'utilisateur du registre secondaire Requis si vous utilisez un deuxième registre de conteneur privé.
registry-2-passwd Mot de passe du registre secondaire Requis si vous utilisez un deuxième registre de conteneur privé.
image-oneliner Commande codée sur une ligne Commande bash sur une ligne exécutée lors du provisionnement de VM. Cette valeur doit être codée en base64.

Vous pouvez utiliser cette propriété pour spécifier le conteneur de charges de travail DL à déployer, tel que PyTorch ou TensorFlow. Reportez-vous à la section Charges de travail d'apprentissage profond dans VMware Private AI Foundation with NVIDIA.

Attention : Évitez d'utiliser user-data et image-oneliner.
docker-compose-uri Fichier Docker Compose codé

Requis si vous avez besoin d'un fichier Docker Compose pour démarrer le conteneur de charges de travail DL. Contenu du fichier docker-compose.yaml qui sera inséré dans la machine virtuelle lors du provisionnement après le démarrage de la machine virtuelle avec GPU activé. Cette valeur doit être codée en base64.

config-json Fichier config.json codé Contenu d'un fichier de configuration pour ajouter les détails suivants :

Cette valeur doit être codée en base64.

conda-environment-install Installation de l'environnement Conda Liste séparée par des virgules d'environnements Conda à installer automatiquement à la fin du déploiement de VM.

Environnements disponibles : pytorch2.3_py3.12, pytorch1.13.1_py3.10, tf2.16.1_py3.12 et tf1.15.5_py3.7.

Charges de travail d'apprentissage profond dans VMware Private AI Foundation with NVIDIA

Vous pouvez provisionner une machine virtuelle à apprentissage profond avec une charge de travail d'apprentissage profond (DL) prise en charge en plus de ses composants intégrés. Les charges de travail DL sont téléchargées à partir du catalogue NVIDIA NGC et sont optimisées pour le GPU et validées par NVIDIA et VMware by Broadcom.

Pour obtenir une présentation des images de VM à apprentissage profond, reportez-vous à la section À propos des images de VM à apprentissage profond dans VMware Private AI Foundation with NVIDIA.

Exemple CUDA

Vous pouvez utiliser une VM à apprentissage profond avec des exemples CUDA en cours d'exécution pour explorer l'ajout de vecteurs, la simulation gravitationnelle à N corps ou d'autres exemples sur une VM. Reportez-vous à la page Exemples CUDA.

Une fois la VM à apprentissage profond lancée, elle exécute une charge de travail d'exemples CUDA pour tester le pilote invité vGPU. Vous pouvez examiner la sortie de test dans le fichier /var/log/dl.log.

Tableau 1. Image de conteneur d'exemples CUDA
Composant Description
Image de conteneur
nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag
Par exemple :
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

Pour plus d'informations sur les images du conteneur d'exemples CUDA qui sont prises en charge pour les VM à apprentissage profond, reportez-vous à la section Notes de mise à jour de VMware Deep Learning VM.

Entrées requises Pour déployer une charge de travail d'exemples CUDA, vous devez définir les propriétés OVF de la machine virtuelle à apprentissage profond de la manière suivante :
  • Utilisez l'une des propriétés suivantes qui sont spécifiques à l'image d'exemples CUDA.
    • Script cloud-init. Codez-le au format base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          set_proxy "http" "https" "socks5"
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          deploy_dcgm_exporter
      
          echo "Info: running the vectoradd CUDA container"
          CUDA_SAMPLE_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/cuda-sample"
          CUDA_SAMPLE_VERSION="ngc_image_tag"
          docker run -d $CUDA_SAMPLE_IMAGE:$CUDA_SAMPLE_VERSION
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
        
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Par exemple, pour vectoradd-cuda11.7.1-ubi8, fournissez le script suivant au format base64 :

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICBzZXRfcHJveHkgImh0dHAiICJodHRwcyIgInNvY2tzNSIKICAgIHRyYXAgJ2Vycm9yX2V4aXQgIlVuZXhwZWN0ZWQgZXJyb3Igb2NjdXJzIGF0IGRsIHdvcmtsb2FkIicgRVJSCiAgICBERUZBVUxUX1JFR19VUkk9Im52Y3IuaW8iCiAgICBSRUdJU1RSWV9VUklfUEFUSD0kKGdyZXAgcmVnaXN0cnktdXJpIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKCiAgICBpZiBbWyAteiAiJFJFR0lTVFJZX1VSSV9QQVRIIiBdXTsgdGhlbgogICAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGlzIG51bGwgb3IgZW1wdHksIHVzZSB0aGUgZGVmYXVsdCB2YWx1ZQogICAgICBSRUdJU1RSWV9VUklfUEFUSD0kREVGQVVMVF9SRUdfVVJJCiAgICAgIGVjaG8gIlJFR0lTVFJZX1VSSV9QQVRIIHdhcyBlbXB0eS4gVXNpbmcgZGVmYXVsdDogJFJFR0lTVFJZX1VSSV9QQVRIIgogICAgZmkKICAgIAogICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBjb250YWlucyAnLycsIGV4dHJhY3QgdGhlIFVSSSBwYXJ0CiAgICBpZiBbWyAkUkVHSVNUUllfVVJJX1BBVEggPT0gKiIvIiogXV07IHRoZW4KICAgICAgUkVHSVNUUllfVVJJPSQoZWNobyAiJFJFR0lTVFJZX1VSSV9QQVRIIiB8IGN1dCAtZCcvJyAtZjEpCiAgICBlbHNlCiAgICAgIFJFR0lTVFJZX1VSST0kUkVHSVNUUllfVVJJX1BBVEgKICAgIGZpCiAgCiAgICBSRUdJU1RSWV9VU0VSTkFNRT0kKGdyZXAgcmVnaXN0cnktdXNlciAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBSRUdJU1RSWV9QQVNTV09SRD0kKGdyZXAgcmVnaXN0cnktcGFzc3dkIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIGlmIFtbIC1uICIkUkVHSVNUUllfVVNFUk5BTUUiICYmIC1uICIkUkVHSVNUUllfUEFTU1dPUkQiIF1dOyB0aGVuCiAgICAgIGRvY2tlciBsb2dpbiAtdSAkUkVHSVNUUllfVVNFUk5BTUUgLXAgJFJFR0lTVFJZX1BBU1NXT1JEICRSRUdJU1RSWV9VUkkKICAgIGVsc2UKICAgICAgZWNobyAiV2FybmluZzogdGhlIHJlZ2lzdHJ5J3MgdXNlcm5hbWUgYW5kIHBhc3N3b3JkIGFyZSBpbnZhbGlkLCBTa2lwcGluZyBEb2NrZXIgbG9naW4uIgogICAgZmkKICAgIAogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIKCiAgICBlY2hvICJJbmZvOiBydW5uaW5nIHRoZSB2ZWN0b3JhZGQgQ1VEQSBjb250YWluZXIiCiAgICBDVURBX1NBTVBMRV9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvY3VkYS1zYW1wbGUiCiAgICBDVURBX1NBTVBMRV9WRVJTSU9OPSJ2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4IgogICAgZG9ja2VyIHJ1biAtZCAkQ1VEQV9TQU1QTEVfSU1BR0U6JENVREFfU0FNUExFX1ZFUlNJT04KCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQogIAogICAgICBjaGVja19wcm90b2NvbCAiJHtIVFRQX1BST1hZX1VSTH0iICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IgogICAgICBjaGVja19wcm90b2NvbCAiJHtIVFRQU19QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKCiAgICAgIGlmICEgZ3JlcCAtcSAnaHR0cF9wcm94eScgL2V0Yy9lbnZpcm9ubWVudDsgdGhlbgogICAgICAgIHN1ZG8gYmFzaCAtYyAnZWNobyAiZXhwb3J0IGh0dHBfcHJveHk9JHtIVFRQX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgaHR0cHNfcHJveHk9JHtIVFRQU19QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IEhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUFNfUFJPWFk9JHtIVFRQU19QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IG5vX3Byb3h5PWxvY2FsaG9zdCwxMjcuMC4wLjEiID4+IC9ldGMvZW52aXJvbm1lbnQnCiAgICAgICAgc291cmNlIC9ldGMvZW52aXJvbm1lbnQKICAgICAgZmkKICAgICAgCiAgICAgICMgQ29uZmlndXJlIERvY2tlciB0byB1c2UgYSBwcm94eQogICAgICBzdWRvIG1rZGlyIC1wIC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZAogICAgICBzdWRvIGJhc2ggLWMgJ2VjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mJwogICAgICBzdWRvIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN1ZG8gc3lzdGVtY3RsIHJlc3RhcnQgZG9ja2VyCgogICAgICBlY2hvICJJbmZvOiBkb2NrZXIgYW5kIHN5c3RlbSBlbnZpcm9ubWVudCBhcmUgbm93IGNvbmZpZ3VyZWQgdG8gdXNlIHRoZSBwcm94eSBzZXR0aW5ncyIKICAgIH0KCiAgICBkZXBsb3lfZGNnbV9leHBvcnRlcigpIHsKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKICAgICAgRENHTV9FWFBPUlRfUFVCTElDPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5leHBvcnRfZGNnbV90b19wdWJsaWMgLy8gZW1wdHknKQoKICAgICAgRENHTV9FWFBPUlRFUl9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvZGNnbS1leHBvcnRlciIKICAgICAgRENHTV9FWFBPUlRFUl9WRVJTSU9OPSIzLjIuNS0zLjEuOC11YnVudHUyMi4wNCIKICAgICAgaWYgWyAteiAiJHtEQ0dNX0VYUE9SVF9QVUJMSUN9IiBdIHx8IFsgIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgIT0gInRydWUiIF07IHRoZW4KICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgbGlzdGVuaW5nIG9ubHkgb24gbG9jYWxob3N0ICgxMjcuMC4wLjE6OTQwMCkiCiAgICAgICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLXAgMTI3LjAuMC4xOjk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGVsc2UKICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgZXhwb3NlZCBvbiBhbGwgbmV0d29yayBpbnRlcmZhY2VzICgwLjAuMC4wOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGZpCiAgICB9

      qui correspond au script suivant au format texte brut :

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          set_proxy "http" "https" "socks5"
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          deploy_dcgm_exporter
      
          echo "Info: running the vectoradd CUDA container"
          CUDA_SAMPLE_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/cuda-sample"
          CUDA_SAMPLE_VERSION="vectoradd-cuda11.7.1-ubi8"
          docker run -d $CUDA_SAMPLE_IMAGE:$CUDA_SAMPLE_VERSION
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
        
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Image sur une ligne. Codez-le au format base64
      docker run -d nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag

      Par exemple, pour vectoradd-cuda11.7.1-ubi8, fournissez le script suivant au format base64 :

      ZG9ja2VyIHJ1biAtZCBudmNyLmlvL252aWRpYS9rOHMvY3VkYS1zYW1wbGU6dmVjdG9yYWRkLWN1ZGExMS43LjEtdWJpOA==

      qui correspond au script suivant au format texte brut :

      docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
  • Entrez les propriétés d'installation du pilote invité vGPU, telles que vgpu-license et nvidia-portal-api-key.
  • Fournissez les valeurs des propriétés requises pour un environnement déconnecté si nécessaire.

Reportez-vous à la section Propriétés OVF des VM à apprentissage profond.

Sortie
  • Journaux d'installation du pilote invité vGPU dans /var/log/vgpu-install.log.

    Pour vérifier que le pilote invité vGPU est installé et que la licence est allouée, exécutez la commande suivante :

    nvidia-smi -q |grep -i license
  • Journaux de script cloud-init dans /var/log/dl.log.

PyTorch

Vous pouvez utiliser une VM à apprentissage profond avec une bibliothèque PyTorch pour découvrir l'IA conversationnelle, le traitement automatique des langues (NLP, Natural language processing) et d'autres types de modèles d'IA sur une VM. Reportez-vous à la page PyTorch.

Une fois la VM à apprentissage profond lancée, elle démarre une instance de JupyterLab sur laquelle les modules PyTorch sont installés et configurés.

Tableau 2. Image de conteneur PyTorch
Composant Description
Image de conteneur
nvcr.io/nvidia/pytorch-pb24h1:ngc_image_tag
Par exemple :
nvcr.io/nvidia/pytorch-pb24h1:24.03.02-py3

Pour plus d'informations sur les images de conteneur PyTorch prises en charge pour les VM à apprentissage profond, reportez-vous à la section Notes de mise à jour de VMware Deep Learning VM.

Entrées requises Pour déployer une charge de travail PyTorch, vous devez définir les propriétés OVF de la machine virtuelle à apprentissage profond de la manière suivante :
  • Utilisez l'une des propriétés suivantes spécifiques à l'image PyTorch.
    • Script cloud-init. Codez-le au format base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the PyTorch container"
          PYTORCH_IMAGE="$REGISTRY_URI_PATH/nvidia/pytorch-pb24h1"
          PYTORCH_VERSION="ngc_image_tag"
          docker run -d --gpus all -p 8888:8888 $PYTORCH_IMAGE:$PYTORCH_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Par exemple, pour pytorch-pb24h1:24.03.02-py3, fournissez le script suivant au format base64 :

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC1wIDg4ODg6ODg4OCAkUkVHSVNUUllfVVJJX1BBVEgvbnZpZGlhL3B5dG9yY2g6MjMuMTAtcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2UKCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9

      qui correspond au script suivant au format texte brut.

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the PyTorch container"
          PYTORCH_IMAGE="$REGISTRY_URI_PATH/nvidia/pytorch-pb24h1"
          PYTORCH_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8888:8888 $PYTORCH_IMAGE:$PYTORCH_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Image sur une ligne. Codez-le au format base64.
      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch-pb24h1:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      Par exemple, pour pytorch-pb24h1:24.03.02-py3, fournissez le script suivant au format base64 :

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvcHl0b3JjaC1wYjI0aDE6MjQuMDMuMDItcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2U=

      qui correspond au script suivant au format texte brut :

      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch-pb24h1:24.03.02-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Entrez les propriétés d'installation du pilote invité vGPU, telles que vgpu-license et nvidia-portal-api-key.
  • Fournissez les valeurs des propriétés requises pour un environnement déconnecté si nécessaire.

Reportez-vous à la section Propriétés OVF des VM à apprentissage profond.

Sortie
  • Journaux d'installation du pilote invité vGPU dans /var/log/vgpu-install.log.

    Pour vérifier que le pilote invité vGPU est installé, exécutez la commande nvidia-smi.

  • Journaux de script cloud-init dans /var/log/dl.log.
  • Conteneur PyTorch.

    Pour vérifier que le conteneur PyTorch est en cours d'exécution, exécutez les commandes sudo docker ps -a et sudo docker logs container_id.

  • Instance de JupyterLab à laquelle vous pouvez accéder sur http://dl_vm_ip:8888

    Dans le terminal de JupyterLab, vérifiez que les fonctionnalités suivantes sont disponibles dans le bloc-notes :

    • Pour vérifier que JupyterLab peut accéder à la ressource vGPU, exécutez nvidia-smi.
    • Pour vérifier que les modules associés à PyTorch sont installés, exécutez pip show.

TensorFlow

Vous pouvez utiliser une VM à apprentissage profond avec une bibliothèque TensorFlow pour découvrir l'IA conversationnelle, le NLP et d'autres types de modèles d'IA sur une VM. Reportez-vous à la page TensorFlow.

Une fois la VM à apprentissage profond lancée, elle démarre une instance de JupyterLab sur laquelle les modules TensorFlow sont installés et configurés.

Tableau 3. Image de conteneur TensorFlow
Composant Description
Image de conteneur
nvcr.io/nvidia/tensorflow-pb24h1:ngc_image_tag

Par exemple :

nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3

Pour plus d'informations sur les images de conteneur TensorFlow prises en charge pour les VM à apprentissage profond, reportez-vous à la section Notes de mise à jour de VMware Deep Learning VM.

Entrées requises Pour déployer une charge de travail TensorFlow, vous devez définir les propriétés OVF de la machine virtuelle à apprentissage profond de la manière suivante :
  • Utilisez l'une des propriétés suivantes spécifiques à l'image TensorFlow.
    • Script cloud-init. Codez-le au format base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the Tensorflow container"    
          TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
          TENSORFLOW_VERSION="ngc_image_tag"
          docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Par exemple, pour tensorflow-pb24h1:24.03.02-tf2-py3, fournissez le script suivant au format base64 :

      #cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    set -eu
    source /opt/dlvm/utils.sh
    trap 'error_exit "Unexpected error occurs at dl workload"' ERR
    set_proxy "http" "https" "socks5"
    
    DEFAULT_REG_URI="nvcr.io"
    REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

    if [[ -z "$REGISTRY_URI_PATH" ]]; then
      # If REGISTRY_URI_PATH is null or empty, use the default value
      REGISTRY_URI_PATH=$DEFAULT_REG_URI
      echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
    fi
    
    # If REGISTRY_URI_PATH contains '/', extract the URI part
    if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
      REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
    else
      REGISTRY_URI=$REGISTRY_URI_PATH
    fi
  
    REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
      docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
    else
      echo "Warning: the registry's username and password are invalid, Skipping Docker login."
    fi

    deploy_dcgm_exporter

    CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
    enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')

    if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
      # Generate a random jupyter token
      TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
      # Set the token to guestinfo
      vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
      echo "Info: JupyterLab notebook access token, $TOKEN"
    else
      TOKEN=""
    fi

    echo "Info: running the Tensorflow container"    
    TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
    TENSORFLOW_VERSION="24.03.02-tf2-py3"
    docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
    
- path: /opt/dlvm/utils.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    error_exit() {
      echo "Error: $1" >&2
      vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
      exit 1
    }

    check_protocol() {
      local proxy_url=$1
      shift
      local supported_protocols=("$@")
      if [[ -n "${proxy_url}" ]]; then
        local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
        if [ -z "$protocol" ]; then
          echo "No specific protocol provided. Skipping protocol check."
          return 0
        fi
        local protocol_included=false
        for var in "${supported_protocols[@]}"; do
          if [[ "${protocol}" == "${var}" ]]; then
            protocol_included=true
            break
          fi
        done
        if [[ "${protocol_included}" == false ]]; then
          error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
        fi
      fi
    }

    # $@: list of supported protocols
    set_proxy() {
      local supported_protocols=("$@")

      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)

      HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
      HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
      if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
        echo "Info: The config-json was parsed, but no proxy settings were found."
        return 0
      fi

      check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
      check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"

      if ! grep -q 'http_proxy' /etc/environment; then
        sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
        export https_proxy=${HTTPS_PROXY_URL}
        export HTTP_PROXY=${HTTP_PROXY_URL}
        export HTTPS_PROXY=${HTTPS_PROXY_URL}
        export no_proxy=localhost,127.0.0.1" >> /etc/environment'
        source /etc/environment
      fi
      
      # Configure Docker to use a proxy
      sudo mkdir -p /etc/systemd/system/docker.service.d
      sudo bash -c 'echo "[Service]
      Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
      Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
      Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
      sudo systemctl daemon-reload
      sudo systemctl restart docker

      echo "Info: docker and system environment are now configured to use the proxy settings"
    }

    deploy_dcgm_exporter() {
      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')

      DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
      DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
      if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
        echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      else
        echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      fi
    }

      qui correspond au script suivant au format texte brut :

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the Tensorflow container"    
          TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
          TENSORFLOW_VERSION="24.03.02-tf2-py3"
          docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Image sur une ligne. Codez-le au format base64.
      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      Par exemple, pour tensorflow-pb24h1:24.03.02, fournissez le script suivant au format base64 :

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvdGVuc29yZmxvdy1wYjI0aDE6MjQuMDMuMDItdGYyLXB5MyAvdXNyL2xvY2FsL2Jpbi9qdXB5dGVyIGxhYiAtLWFsbG93LXJvb3QgLS1pcD0qIC0tcG9ydD04ODg4IC0tbm8tYnJvd3NlciAtLU5vdGVib29rQXBwLnRva2VuPScnIC0tTm90ZWJvb2tBcHAuYWxsb3dfb3JpZ2luPScqJyAtLW5vdGVib29rLWRpcj0vd29ya3NwYWNl

      qui correspond au script suivant au format texte brut :

      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Entrez les propriétés d'installation du pilote invité vGPU, telles que vgpu-license et nvidia-portal-api-key.
  • Fournissez les valeurs des propriétés requises pour un environnement déconnecté si nécessaire.

Reportez-vous à la section Propriétés OVF des VM à apprentissage profond.

Sortie
  • Journaux d'installation du pilote invité vGPU dans /var/log/vgpu-install.log.

    Pour vérifier que le pilote invité vGPU est installé, connectez-vous à la VM via SSH et exécutez la commande nvidia-smi.

  • Journaux de script cloud-init dans /var/log/dl.log.
  • Conteneur TensorFlow.

    Pour vérifier que le conteneur TensorFlow est en cours d'exécution, exécutez les commandes sudo docker ps -a et sudo docker logs container_id.

  • Instance de JupyterLab à laquelle vous pouvez accéder sur http://dl_vm_ip:8888.

    Dans le terminal de JupyterLab, vérifiez que les fonctionnalités suivantes sont disponibles dans le bloc-notes :

    • Pour vérifier que JupyterLab peut accéder à la ressource vGPU, exécutez nvidia-smi.
    • Pour vérifier que les modules associés à TensorFlow sont installés, exécutez pip show.

Exportateur DCGM

Vous pouvez utiliser une VM à apprentissage profond avec l'exportateur Data Center GPU Manager (DCGM) pour surveiller la santé des GPU et obtenir des mesures de ceux-ci qui sont utilisés par une charge de travail DL, à l'aide de NVIDIA DCGM, Prometheus et Grafana.

Reportez-vous à la page Exportateur DCGM.

Dans une VM à apprentissage profond, exécutez le conteneur de l'exportateur DCGM avec une charge de travail DL qui effectue des opérations d'IA. Une fois la VM à apprentissage profond démarrée, l'exportateur DCGM est prêt à collecter des mesures de vGPU et à exporter les données vers une autre application pour une surveillance et une visualisation accrues. Vous pouvez exécuter la charge de travail DL surveillée dans le cadre du processus cloud-init ou à partir de la ligne de commande après le démarrage de la machine virtuelle.

Tableau 4. Image de conteneur de l'exportateur DCGM
Composant Description
Image de conteneur
nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag

Par exemple :

nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04

Pour plus d'informations sur les images de conteneur de l'exportateur DCGM qui sont prises en charge pour les VM à apprentissage profond, reportez-vous à la section Notes de mise à jour de VMware Deep Learning VM.

Entrées requises Pour déployer une charge de travail de l'exportateur DCGM, vous devez définir les propriétés OVF de la machine virtuelle à apprentissage profond de la manière suivante :
  • Utilisez l'une des propriétés suivantes propres à l'image de l'exportateur DCGM.
    • Script cloud-init. Codez-le au format base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          echo "Info: running the DCGM Export container"
          deploy_dcgm_exporter
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="ngc_image_tag"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Par exemple, pour une VM à apprentissage profond avec une instance de l'exportateur DCGM dcgm-exporter:3.2.5-3.1.8-ubuntu22.04 préinstallée, fournissez le script suivant au format base64

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCiAgICAKICAgIERFRkFVTFRfUkVHX1VSST0ibnZjci5pbyIKICAgIFJFR0lTVFJZX1VSSV9QQVRIPSQoZ3JlcCByZWdpc3RyeS11cmkgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQoKICAgIGlmIFtbIC16ICIkUkVHSVNUUllfVVJJX1BBVEgiIF1dOyB0aGVuCiAgICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggaXMgbnVsbCBvciBlbXB0eSwgdXNlIHRoZSBkZWZhdWx0IHZhbHVlCiAgICAgIFJFR0lTVFJZX1VSSV9QQVRIPSRERUZBVUxUX1JFR19VUkkKICAgICAgZWNobyAiUkVHSVNUUllfVVJJX1BBVEggd2FzIGVtcHR5LiBVc2luZyBkZWZhdWx0OiAkUkVHSVNUUllfVVJJX1BBVEgiCiAgICBmaQogICAgCiAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGNvbnRhaW5zICcvJywgZXh0cmFjdCB0aGUgVVJJIHBhcnQKICAgIGlmIFtbICRSRUdJU1RSWV9VUklfUEFUSCA9PSAqIi8iKiBdXTsgdGhlbgogICAgICBSRUdJU1RSWV9VUkk9JChlY2hvICIkUkVHSVNUUllfVVJJX1BBVEgiIHwgY3V0IC1kJy8nIC1mMSkKICAgIGVsc2UKICAgICAgUkVHSVNUUllfVVJJPSRSRUdJU1RSWV9VUklfUEFUSAogICAgZmkKICAKICAgIFJFR0lTVFJZX1VTRVJOQU1FPSQoZ3JlcCByZWdpc3RyeS11c2VyIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIFJFR0lTVFJZX1BBU1NXT1JEPSQoZ3JlcCByZWdpc3RyeS1wYXNzd2QgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgaWYgW1sgLW4gIiRSRUdJU1RSWV9VU0VSTkFNRSIgJiYgLW4gIiRSRUdJU1RSWV9QQVNTV09SRCIgXV07IHRoZW4KICAgICAgZG9ja2VyIGxvZ2luIC11ICRSRUdJU1RSWV9VU0VSTkFNRSAtcCAkUkVHSVNUUllfUEFTU1dPUkQgJFJFR0lTVFJZX1VSSQogICAgZWxzZQogICAgICBlY2hvICJXYXJuaW5nOiB0aGUgcmVnaXN0cnkncyB1c2VybmFtZSBhbmQgcGFzc3dvcmQgYXJlIGludmFsaWQsIFNraXBwaW5nIERvY2tlciBsb2dpbi4iCiAgICBmaQoKICAgIGVjaG8gIkluZm86IHJ1bm5pbmcgdGhlIERDR00gRXhwb3J0IGNvbnRhaW5lciIKICAgIGRlcGxveV9kY2dtX2V4cG9ydGVyCgotIHBhdGg6IC9vcHQvZGx2bS91dGlscy5zaAogIHBlcm1pc3Npb25zOiAnMDc1NScKICBjb250ZW50OiB8CiAgICAjIS9iaW4vYmFzaAogICAgZXJyb3JfZXhpdCgpIHsKICAgICAgZWNobyAiRXJyb3I6ICQxIiA+JjIKICAgICAgdm10b29sc2QgLS1jbWQgImluZm8tc2V0IGd1ZXN0aW5mby52bXNlcnZpY2UuYm9vdHN0cmFwLmNvbmRpdGlvbiBmYWxzZSwgRExXb3JrbG9hZEZhaWx1cmUsICQxIgogICAgICBleGl0IDEKICAgIH0KCiAgICBjaGVja19wcm90b2NvbCgpIHsKICAgICAgbG9jYWwgcHJveHlfdXJsPSQxCiAgICAgIHNoaWZ0CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCiAgICAgIGlmIFtbIC1uICIke3Byb3h5X3VybH0iIF1dOyB0aGVuCiAgICAgICAgbG9jYWwgcHJvdG9jb2w9JChlY2hvICIke3Byb3h5X3VybH0iIHwgYXdrIC1GICc6Ly8nICd7aWYgKE5GID4gMSkgcHJpbnQgJDE7IGVsc2UgcHJpbnQgIiJ9JykKICAgICAgICBpZiBbIC16ICIkcHJvdG9jb2wiIF07IHRoZW4KICAgICAgICAgIGVjaG8gIk5vIHNwZWNpZmljIHByb3RvY29sIHByb3ZpZGVkLiBTa2lwcGluZyBwcm90b2NvbCBjaGVjay4iCiAgICAgICAgICByZXR1cm4gMAogICAgICAgIGZpCiAgICAgICAgbG9jYWwgcHJvdG9jb2xfaW5jbHVkZWQ9ZmFsc2UKICAgICAgICBmb3IgdmFyIGluICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IjsgZG8KICAgICAgICAgIGlmIFtbICIke3Byb3RvY29sfSIgPT0gIiR7dmFyfSIgXV07IHRoZW4KICAgICAgICAgICAgcHJvdG9jb2xfaW5jbHVkZWQ9dHJ1ZQogICAgICAgICAgICBicmVhawogICAgICAgICAgZmkKICAgICAgICBkb25lCiAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2xfaW5jbHVkZWR9IiA9PSBmYWxzZSBdXTsgdGhlbgogICAgICAgICAgZXJyb3JfZXhpdCAiVW5zdXBwb3J0ZWQgcHJvdG9jb2w6ICR7cHJvdG9jb2x9LiBTdXBwb3J0ZWQgcHJvdG9jb2xzIGFyZTogJHtzdXBwb3J0ZWRfcHJvdG9jb2xzWypdfSIKICAgICAgICBmaQogICAgICBmaQogICAgfQoKICAgICMgJEA6IGxpc3Qgb2Ygc3VwcG9ydGVkIHByb3RvY29scwogICAgc2V0X3Byb3h5KCkgewogICAgICBsb2NhbCBzdXBwb3J0ZWRfcHJvdG9jb2xzPSgiJEAiKQoKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKCiAgICAgIEhUVFBfUFJPWFlfVVJMPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5odHRwX3Byb3h5IC8vIGVtcHR5JykKICAgICAgSFRUUFNfUFJPWFlfVVJMPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5odHRwc19wcm94eSAvLyBlbXB0eScpCiAgICAgIGlmIFtbICQ/IC1uZSAwIHx8ICgteiAiJHtIVFRQX1BST1hZX1VSTH0iICYmIC16ICIke0hUVFBTX1BST1hZX1VSTH0iKSBdXTsgdGhlbgogICAgICAgIGVjaG8gIkluZm86IFRoZSBjb25maWctanNvbiB3YXMgcGFyc2VkLCBidXQgbm8gcHJveHkgc2V0dGluZ3Mgd2VyZSBmb3VuZC4iCiAgICAgICAgcmV0dXJuIDAKICAgICAgZmkKCiAgICAgIGNoZWNrX3Byb3RvY29sICIke0hUVFBfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCiAgICAgIGNoZWNrX3Byb3RvY29sICIke0hUVFBTX1BST1hZX1VSTH0iICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IgoKICAgICAgaWYgISBncmVwIC1xICdodHRwX3Byb3h5JyAvZXRjL2Vudmlyb25tZW50OyB0aGVuCiAgICAgICAgc3VkbyBiYXNoIC1jICdlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudCcKICAgICAgICBzb3VyY2UgL2V0Yy9lbnZpcm9ubWVudAogICAgICBmaQogICAgICAKICAgICAgIyBDb25maWd1cmUgRG9ja2VyIHRvIHVzZSBhIHByb3h5CiAgICAgIHN1ZG8gbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIHN1ZG8gYmFzaCAtYyAnZWNobyAiW1NlcnZpY2VdCiAgICAgIEVudmlyb25tZW50PVwiSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiSFRUUFNfUFJPWFk9JHtIVFRQU19QUk9YWV9VUkx9XCIKICAgICAgRW52aXJvbm1lbnQ9XCJOT19QUk9YWT1sb2NhbGhvc3QsMTI3LjAuMC4xXCIiID4gL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kL3Byb3h5LmNvbmYnCiAgICAgIHN1ZG8gc3lzdGVtY3RsIGRhZW1vbi1yZWxvYWQKICAgICAgc3VkbyBzeXN0ZW1jdGwgcmVzdGFydCBkb2NrZXIKCgogICAgICBlY2hvICJJbmZvOiBkb2NrZXIgYW5kIHN5c3RlbSBlbnZpcm9ubWVudCBhcmUgbm93IGNvbmZpZ3VyZWQgdG8gdXNlIHRoZSBwcm94eSBzZXR0aW5ncyIKICAgIH0KCiAgICBkZXBsb3lfZGNnbV9leHBvcnRlcigpIHsKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKICAgICAgRENHTV9FWFBPUlRfUFVCTElDPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5leHBvcnRfZGNnbV90b19wdWJsaWMgLy8gZW1wdHknKQoKICAgICAgRENHTV9FWFBPUlRFUl9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvZGNnbS1leHBvcnRlciIKICAgICAgRENHTV9FWFBPUlRFUl9WRVJTSU9OPSIzLjIuNS0zLjEuOC11YnVudHUyMi4wNCIKICAgICAgaWYgWyAteiAiJHtEQ0dNX0VYUE9SVF9QVUJMSUN9IiBdIHx8IFsgIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgIT0gInRydWUiIF07IHRoZW4KICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgbGlzdGVuaW5nIG9ubHkgb24gbG9jYWxob3N0ICgxMjcuMC4wLjE6OTQwMCkiCiAgICAgICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLXAgMTI3LjAuMC4xOjk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGVsc2UKICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgZXhwb3NlZCBvbiBhbGwgbmV0d29yayBpbnRlcmZhY2VzICgwLjAuMC4wOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGZpCiAgICB9
      qui correspond au script suivant au format texte brut :
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          echo "Info: running the DCGM Export container"
          deploy_dcgm_exporter
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
      Note : Vous pouvez également ajouter au script cloud-init les instructions d'exécution de la charge de travail DL dont vous souhaitez mesurer les performances de GPU avec l'exportateur DCGM.
    • Image sur une ligne. Codez-le au format base64.
      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag-ubuntu22.04

      Par exemple, pour dcgm-exporter:3.2.5-3.1.8-ubuntu22.04, fournissez le script suivant au format base64 :

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLS1ybSAtcCA5NDAwOjk0MDAgbnZjci5pby9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXI6My4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQ=

      qui correspond au script suivant au format texte brut :

      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
  • Entrez les propriétés d'installation du pilote invité vGPU, telles que vgpu-license et nvidia-portal-api-key.
  • Fournissez les valeurs des propriétés requises pour un environnement déconnecté si nécessaire.

Reportez-vous à la section Propriétés OVF des VM à apprentissage profond.

Sortie
  • Journaux d'installation du pilote invité vGPU dans /var/log/vgpu-install.log.

    Pour vérifier que le pilote invité vGPU est installé, connectez-vous à la VM via SSH et exécutez la commande nvidia-smi.

  • Journaux de script cloud-init dans /var/log/dl.log.
  • Exportateur DCGM auquel vous pouvez accéder à l'adresse http://dl_vm_ip:9400.

Ensuite, dans la VM à apprentissage profond, exécutez une charge de travail DL et visualisez les données sur une autre machine virtuelle à l'aide de Prometheus à l'adresse http://visualization_vm_ip:9090 et Grafana à l'adresse http://visualization_vm_ip:3000.

Exécuter une charge de travail DL sur la VM à apprentissage profond

Exécutez la charge de travail DL pour laquelle vous souhaitez collecter des mesures vGPU et exportez les données vers une autre application pour une surveillance et une visualisation accrues.

  1. Connectez-vous à la VM à apprentissage profond en tant que vmware via SSH.
  2. Exécutez le conteneur pour la charge de travail DL, en l'extrayant du catalogue NVIDIA NGC ou d'un registre de conteneur local.

    Par exemple, pour exécuter la commande suivante afin de lancer l'image tensorflow-pb24h1:24.03.02-tf2-py3 à partir de NVIDIA NGC :

    docker run -d --gpus all -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
  3. Utilisez la charge de travail DL pour le développement de l'IA.

Installer Prometheus et Grafana

Vous pouvez visualiser et surveiller les mesures vGPU à partir de la machine virtuelle de l'exportateur DCGM sur une machine virtuelle exécutant Prometheus et Grafana.

  1. Créez une VM de visualisation sur laquelle Docker Community Engine est installé.
  2. Connectez-vous à la VM via SSH et créez un fichier YAML pour Prometheus.
    $ cat > prometheus.yml << EOF
    global:
      scrape_interval: 15s
      external_labels:
        monitor: 'codelab-monitor'
    scrape_configs:
      - job_name: 'dcgm'
        scrape_interval: 5s
        metrics_path: /metrics
        static_configs:
          - targets: [dl_vm_with_dcgm_exporter_ip:9400']
    EOF
    
  3. Créez un chemin de données.
    $ mkdir grafana_data prometheus_data && chmod 777 grafana_data prometheus_data
    
  4. Créez un fichier Docker Compose pour installer Prometheus et Grafana.
    $ cat > compose.yaml << EOF
    services:
      prometheus:
        image: prom/prometheus:v2.47.2
        container_name: "prometheus0"
        restart: always
        ports:
          - "9090:9090"
        volumes:
          - "./prometheus.yml:/etc/prometheus/prometheus.yml"
          - "./prometheus_data:/prometheus"
      grafana:
        image: grafana/grafana:10.2.0-ubuntu
        container_name: "grafana0"
        ports:
          - "3000:3000"
        restart: always
        volumes:
          - "./grafana_data:/var/lib/grafana"
    EOF
    
  5. Démarrez les conteneurs Prometheus et Grafana.
    $ sudo docker compose up -d        
    

Afficher les mesures vGPU dans Prometheus

Vous pouvez accéder à Prometheus à l'adresse http://visualization-vm-ip:9090. Vous pouvez afficher les informations de vGPU suivantes dans l'interface utilisateur de Prometheus :

Informations Section de l'interface utilisateur
Mesures de vGPU brutes de la VM à apprentissage profond État > Cible

Pour afficher les mesures vGPU brutes de la VM à apprentissage profond, cliquez sur l'entrée du point de terminaison.

Expressions graphiques
  1. Dans la barre de navigation principale, cliquez sur l'onglet Graphique.
  2. Entrez une expression et cliquez sur Exécuter.

Pour plus d'informations sur l'utilisation de Prometheus, reportez-vous à la documentation de Prometheus.

Visualiser les mesures dans Grafana

Définissez Prometheus comme source de données pour Grafana et visualisez les mesures vGPU à partir de la VM à apprentissage profond dans un tableau de bord.

  1. Accédez à Grafana à l'adresse http://visualization-vm-ip:3000 en utilisant par défaut le nom d'utilisateur admin et le mot de passe admin.
  2. Ajoutez Prometheus comme première source de données, en vous connectant à visualization-vm-ip sur le port 9090.
  3. Créez un tableau de bord avec les mesures vGPU.

Pour plus d'informations sur la configuration d'un tableau de bord à l'aide d'une source de données Prometheus, reportez-vous à la documentation de Grafana.

Serveur d'inférence Triton

Vous pouvez utiliser une VM à apprentissage profond avec un serveur d'inférence Triton pour charger un référentiel de modèles et recevoir des demandes d'inférence.

Reportez-vous à la page Serveur d'inférence Triton.

Tableau 5. Image de conteneur du serveur d'inférence Triton
Composant Description
Image de conteneur
nvcr.io/nvidia/tritonserver-pb24h1:ngc_image_tag

Par exemple :

nvcr.io/nvidia/tritonserver-pb24h1:24.03.02-py3

Pour plus d'informations sur les images de conteneur du serveur d'inférence Triton prises en charge pour les VM à apprentissage profond, reportez-vous à la section Notes de mise à jour de VMware Deep Learning VM.

Entrées requises Pour déployer une charge de travail du serveur d'inférence Triton, vous devez définir les propriétés OVF de la machine virtuelle à apprentissage profond de la manière suivante :
  • Utilisez l'une des propriétés suivantes qui sont spécifiques à l'image du serveur d'inférence Triton.
    • Script cloud-init. Codez-le au format base64.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          echo "Info: running the Triton Inference Server container"
          TRITON_IMAGE="$REGISTRY_URI_PATH/nvidia/tritonserver-pb24h1"
          TRITON_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/vmware/model_repository:/models $TRITON_IMAGE:$TRITON_VERSION tritonserver --model-repository=/models --model-control-mode=poll
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Par exemple, pour tritonserver:23.10-py3, fournissez le script suivant au format base64 :

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIKCiAgICBlY2hvICJJbmZvOiBydW5uaW5nIHRoZSBUcml0b24gSW5mZXJlbmNlIFNlcnZlciBjb250YWluZXIiCiAgICBUUklUT05fSU1BR0U9IiRSRUdJU1RSWV9VUklfUEFUSC9udmlkaWEvdHJpdG9uc2VydmVyLXBiMjRoMSIKICAgIFRSSVRPTl9WRVJTSU9OPSIyNC4wMy4wMi1weTMiCiAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLXAgODAwMDo4MDAwIC1wIDgwMDE6ODAwMSAtcCA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyAkVFJJVE9OX0lNQUdFOiRUUklUT05fVkVSU0lPTiB0cml0b25zZXJ2ZXIgLS1tb2RlbC1yZXBvc2l0b3J5PS9tb2RlbHMgLS1tb2RlbC1jb250cm9sLW1vZGU9cG9sbAogICAgCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBzdWRvIGJhc2ggLWMgJ2VjaG8gImV4cG9ydCBodHRwX3Byb3h5PSR7SFRUUF9QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IGh0dHBzX3Byb3h5PSR7SFRUUFNfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQX1BST1hZPSR7SFRUUF9QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IEhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBub19wcm94eT1sb2NhbGhvc3QsMTI3LjAuMC4xIiA+PiAvZXRjL2Vudmlyb25tZW50JwogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgc3VkbyBta2RpciAtcCAvZXRjL3N5c3RlbWQvc3lzdGVtL2RvY2tlci5zZXJ2aWNlLmQKICAgICAgc3VkbyBiYXNoIC1jICdlY2hvICJbU2VydmljZV0KICAgICAgRW52aXJvbm1lbnQ9XCJIVFRQX1BST1hZPSR7SFRUUF9QUk9YWV9VUkx9XCIKICAgICAgRW52aXJvbm1lbnQ9XCJIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIk5PX1BST1hZPWxvY2FsaG9zdCwxMjcuMC4wLjFcIiIgPiAvZXRjL3N5c3RlbWQvc3lzdGVtL2RvY2tlci5zZXJ2aWNlLmQvcHJveHkuY29uZicKICAgICAgc3VkbyBzeXN0ZW1jdGwgZGFlbW9uLXJlbG9hZAogICAgICBzdWRvIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9CgogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIoKSB7CiAgICAgIENPTkZJR19KU09OX0JBU0U2ND0kKGdyZXAgJ2NvbmZpZy1qc29uJyAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICAgIENPTkZJR19KU09OPSQoZWNobyAke0NPTkZJR19KU09OX0JBU0U2NH0gfCBiYXNlNjQgLS1kZWNvZGUpCiAgICAgIERDR01fRVhQT1JUX1BVQkxJQz0kKGVjaG8gIiR7Q09ORklHX0pTT059IiB8IGpxIC1yICcuZXhwb3J0X2RjZ21fdG9fcHVibGljIC8vIGVtcHR5JykKCiAgICAgIERDR01fRVhQT1JURVJfSU1BR0U9IiRSRUdJU1RSWV9VUklfUEFUSC9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXIiCiAgICAgIERDR01fRVhQT1JURVJfVkVSU0lPTj0iMy4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQiCiAgICAgIGlmIFsgLXogIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgXSB8fCBbICIke0RDR01fRVhQT1JUX1BVQkxJQ30iICE9ICJ0cnVlIiBdOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogbGF1bmNoaW5nIERDR00gRXhwb3J0ZXIgdG8gY29sbGVjdCB2R1BVIG1ldHJpY3MsIGxpc3RlbmluZyBvbmx5IG9uIGxvY2FsaG9zdCAoMTI3LjAuMC4xOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDEyNy4wLjAuMTo5NDAwOjk0MDAgJERDR01fRVhQT1JURVJfSU1BR0U6JERDR01fRVhQT1JURVJfVkVSU0lPTgogICAgICBlbHNlCiAgICAgICAgZWNobyAiSW5mbzogbGF1bmNoaW5nIERDR00gRXhwb3J0ZXIgdG8gY29sbGVjdCB2R1BVIG1ldHJpY3MsIGV4cG9zZWQgb24gYWxsIG5ldHdvcmsgaW50ZXJmYWNlcyAoMC4wLjAuMDo5NDAwKSIKICAgICAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLS1jYXAtYWRkIFNZU19BRE1JTiAtcCA5NDAwOjk0MDAgJERDR01fRVhQT1JURVJfSU1BR0U6JERDR01fRVhQT1JURVJfVkVSU0lPTgogICAgICBmaQogICAgfQ==

      qui correspond au script suivant au format texte brut :

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          echo "Info: running the Triton Inference Server container"
          TRITON_IMAGE="$REGISTRY_URI_PATH/nvidia/tritonserver-pb24h1"
          TRITON_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/vmware/model_repository:/models $TRITON_IMAGE:$TRITON_VERSION tritonserver --model-repository=/models --model-control-mode=poll
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Image codée sur une ligne au format base64
      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver-pb24h1:ngc_image_tag tritonserver --model-repository=/models --model-control-mode=poll

      Par exemple, pour tritonserver:24.03.02-py3, fournissez le script suivant au format base64 :

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tcm0gLXA4MDAwOjgwMDAgLXA4MDAxOjgwMDEgLXA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyBudmNyLmlvL252aWRpYS90cml0b25zZXJ2ZXItcGIyNGgxOjI0LjAzLjAyLXB5MyB0cml0b25zZXJ2ZXIgLS1tb2RlbC1yZXBvc2l0b3J5PS9tb2RlbHMgLS1tb2RlbC1jb250cm9sLW1vZGU9cG9sbA==

      qui correspond au script suivant au format texte brut :

      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver-pb24h1:24.03.02-py3 tritonserver --model-repository=/models --model-control-mode=poll
  • Entrez les propriétés d'installation du pilote invité vGPU, telles que vgpu-license et nvidia-portal-api-key.
  • Fournissez les valeurs des propriétés requises pour un environnement déconnecté si nécessaire.

Reportez-vous à la section Propriétés OVF des VM à apprentissage profond.

Sortie
  • Journaux d'installation du pilote invité vGPU dans /var/log/vgpu-install.log.

    Pour vérifier que le pilote invité vGPU est installé, connectez-vous à la VM via SSH et exécutez la commande nvidia-smi.

  • Journaux de script cloud-init dans /var/log/dl.log.
  • Conteneur du serveur d'inférence Triton.

    Pour vérifier que le conteneur du serveur d'inférence Triton est en cours d'exécution, exécutez les commandes docker ps -a et docker logs container_id.

Le référentiel de modèles du serveur d'inférence Triton se trouve dans /home/vmware/model_repository. Initialement, le référentiel de modèles est vide et le journal initial de l'instance du serveur d'inférence Triton indique qu'aucun modèle n'est chargé.

Créer un référentiel de modèles

Pour charger votre modèle de l'inférence, procédez comme suit :

  1. Créez le référentiel de votre modèle.

    Reportez-vous à la documentation du référentiel de modèles du serveur d'inférence Triton de NVIDIA.

  2. Copiez le référentiel de modèles dans /home/vmware/model_repository afin que le serveur d'inférence Triton puisse le charger.
    cp -r path_to_your_created_model_repository/* /home/vmware/model_repository/
    

Envoyer des demandes d'inférence de modèles

  1. Vérifiez que le serveur d'inférence Triton est sain et que les modèles sont prêts en exécutant cette commande dans la console de VM à apprentissage profond.
    curl -v localhost:8000/v2/simple_sequence
  2. Envoyez une demande au modèle en exécutant cette commande sur la VM à apprentissage profond.
    curl -v localhost:8000/v2/models/simple_sequence

Pour plus d'informations sur l'utilisation du serveur d'inférence Triton, reportez-vous à la documentation du référentiel de modèles du serveur d'inférence Triton de NVIDIA.

NVIDIA RAG

Vous pouvez utiliser une VM à apprentissage profond pour créer des solutions de génération augmentée de récupération (RAG) avec un modèle Llama2.

Reportez-vous à la documentation Outil Docker Compose des applications NVIDIA RAG (nécessite des autorisations de compte spécifiques).

Tableau 6. Image du conteneur NVIDIA RAG
Composant Description
Images et modèles de conteneur
docker-compose-nim-ms.yaml
rag-app-multiturn-chatbot/docker-compose.yaml
dans l'exemple de pipeline NVIDIA RAG.

Pour plus d'informations sur les applications de conteneur NVIDIA RAG prises en charge pour les VM à apprentissage profond, reportez-vous à la section Notes de mise à jour de VMware Deep Learning VM.

Entrées requises Pour déployer une charge de travail NVIDIA RAG, vous devez définir les propriétés OVF de la machine virtuelle à apprentissage profond de la manière suivante :
  • Entrez un script cloud-init. Codez-le au format base64.

    Par exemple, pour la version 24.08 de NVIDIA RAG, fournissez le script suivant :

    #cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    set -eu
    source /opt/dlvm/utils.sh
    trap 'error_exit "Unexpected error occurs at dl workload"' ERR
    set_proxy "http" "https"
    
    sudo mkdir -p /opt/data/
    sudo chown vmware:vmware /opt/data
    sudo chmod -R 775 /opt/data
    cd /opt/data/

    cat <<EOF > /opt/data/config.json
    {
      "_comment_1": "This provides default support for RAG v24.08: llama3-8b-instruct model",
      "_comment_2": "Update llm_ms_gpu_id: specifies the GPU device ID to make available to the inference server when using multiple GPU",
      "_comment_3": "Update embedding_ms_gpu_id: specifies the GPU ID used for embedding model processing when using multiple GPU",
      "rag": {
        "org_name": "nvidia",
        "org_team_name": "aiworkflows",
        "rag_name": "ai-chatbot-docker-workflow",
        "rag_version": "24.08",
        "rag_app": "rag-app-multiturn-chatbot",
        "nim_model_profile": "auto",
        "llm_ms_gpu_id": "0",
        "embedding_ms_gpu_id": "0",
        "model_directory": "model-cache",
        "ngc_cli_version": "3.41.2"
      }
    }
    EOF

    CONFIG_JSON=$(cat "/opt/data/config.json")
    required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_NAME" "RAG_VERSION" "RAG_APP" "NIM_MODEL_PROFILE" "LLM_MS_GPU_ID" "EMBEDDING_MS_GPU_ID" "MODEL_DIRECTORY" "NGC_CLI_VERSION")

    # Extract rag values from /opt/data/config.json
    for index in "${!required_vars[@]}"; do
      key="${required_vars[$index]}"
      jq_query=".rag.${key,,} | select (.!=null)"
      value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
      if [[ -z "${value}" ]]; then 
        error_exit "${key} is required but not set."
      else
        eval ${key}=\""${value}"\"
      fi
    done

    # Read parameters from config-json to connect DSM PGVector on RAG
    CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    CONFIG_JSON_PGVECTOR=$(echo "${CONFIG_JSON_BASE64}" | base64 -d)
    PGVECTOR_VALUE=$(echo ${CONFIG_JSON_PGVECTOR} | jq -r '.rag.pgvector')
    if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then
      echo "Info: extract DSM PGVector parameters from config-json in XML"
      POSTGRES_USER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $4}')
      POSTGRES_PASSWORD=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $5}')
      POSTGRES_HOST_IP=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $6}')
      POSTGRES_PORT_NUMBER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $7}')
      POSTGRES_DB=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $8}')

      for var in POSTGRES_USER POSTGRES_PASSWORD POSTGRES_HOST_IP POSTGRES_PORT_NUMBER POSTGRES_DB; do
        if [ -z "${!var}" ]; then
          error_exit "${var} is not set."
        fi
      done
    fi

    gpu_info=$(nvidia-smi -L)
    echo "Info: the detected GPU info, $gpu_info"
    if [[ ${NIM_MODEL_PROFILE} == "auto" ]]; then 
      case "${gpu_info}" in
        *A100*)
          NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
          echo "Info: GPU type A100 detected. Setting tensorrt_llm-A100-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *H100*)
          NIM_MODEL_PROFILE="cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1"
          echo "Info: GPU type H100 detected. Setting tensorrt_llm-H100-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *L40S*)
          NIM_MODEL_PROFILE="d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80"
          echo "Info: GPU type L40S detected. Setting tensorrt_llm-L40S-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *)
          NIM_MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
          echo "Info: No supported GPU type detected (A100, H100, L40S). Setting vllm as the default NIM model profile."
          ;;
      esac
    else
      echo "Info: using the NIM model profile provided by the user, $NIM_MODEL_PROFILE"
    fi

    RAG_URI="${ORG_NAME}/${ORG_TEAM_NAME}/${RAG_NAME}:${RAG_VERSION}"
    RAG_FOLDER="${RAG_NAME}_v${RAG_VERSION}"
    NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"

    if [ ! -f .initialize ]; then
      # clean up
      rm -rf compose.env ngc* ${RAG_NAME}* ${MODEL_DIRECTORY}* .initialize

      # install ngc-cli
      wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip -q ngccli_linux.zip
      export PATH=`pwd`/ngc-cli:${PATH}

      APIKEY=""
      DEFAULT_REG_URI="nvcr.io"

      REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      if [[ -z "${REGISTRY_URI_PATH}" ]]; then
        REGISTRY_URI_PATH=${DEFAULT_REG_URI}
        echo "Info: registry uri was empty. Using default: ${REGISTRY_URI_PATH}"
      fi

      if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${DEFAULT_REG_URI}"* ]]; then
        APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      fi

      if [ -z "${APIKEY}" ]; then
          error_exit "No APIKEY found"
      fi

      # config ngc-cli
      mkdir -p ~/.ngc

      cat << EOF > ~/.ngc/config
      [CURRENT]
      apikey = ${APIKEY}
      format_type = ascii
      org = ${ORG_NAME}
      team = ${ORG_TEAM_NAME}
      ace = no-ace
    EOF
      
      # Extract registry URI if path contains '/'
      if [[ ${REGISTRY_URI_PATH} == *"/"* ]]; then
        REGISTRY_URI=$(echo "${REGISTRY_URI_PATH}" | cut -d'/' -f1)
      else
        REGISTRY_URI=${REGISTRY_URI_PATH}
      fi

      REGISTRY_USER=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

      # Docker login if credentials are provided
      if [[ -n "${REGISTRY_USER}" && -n "${APIKEY}" ]]; then
        docker login -u ${REGISTRY_USER} -p ${APIKEY} ${REGISTRY_URI}
      else
        echo "Warning: the ${REGISTRY_URI} registry's username and password are invalid, Skipping Docker login."
      fi

      # DockerHub login for general components
      DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

      DOCKERHUB_URI=${DOCKERHUB_URI:-docker.io}
      if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
        docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD} ${DOCKERHUB_URI}
      else
        echo "Warning: ${DOCKERHUB_URI} not logged in"
      fi

      # Download RAG files
      ngc registry resource download-version ${RAG_URI}

      mkdir -p /opt/data/${MODEL_DIRECTORY}

      # Update the docker-compose YAML files to correct the issue with GPU free/non-free status reporting
      /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/docker-compose-nim-ms.yaml"> docker-compose-nim-ms.json
      jq --arg profile "${NIM_MODEL_PROFILE}" \
         '.services."nemollm-inference".environment.NIM_MANIFEST_ALLOW_UNSAFE = "1" |
          .services."nemollm-inference".environment.NIM_MODEL_PROFILE = $profile |
          .services."nemollm-inference".deploy.resources.reservations.devices[0].device_ids = ["${LLM_MS_GPU_ID:-0}"] |
          del(.services."nemollm-inference".deploy.resources.reservations.devices[0].count)' docker-compose-nim-ms.json > temp.json && mv temp.json docker-compose-nim-ms.json
      /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < docker-compose-nim-ms.json > "${RAG_FOLDER}/docker-compose-nim-ms.yaml"
      rm -rf docker-compose-nim-ms.json

      # Update docker-compose YAML files to config PGVector as the default databse
      /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"> rag-app-multiturn-chatbot.json
      jq '.services."chain-server".environment.APP_VECTORSTORE_NAME = "pgvector" |
         .services."chain-server".environment.APP_VECTORSTORE_URL = "${POSTGRES_HOST_IP:-pgvector}:${POSTGRES_PORT_NUMBER:-5432}" |
         .services."chain-server".environment.POSTGRES_PASSWORD = "${POSTGRES_PASSWORD:-password}" |
         .services."chain-server".environment.POSTGRES_USER = "${POSTGRES_USER:-postgres}" |
         .services."chain-server".environment.POSTGRES_DB = "${POSTGRES_DB:-api}"' rag-app-multiturn-chatbot.json > temp.json && mv temp.json rag-app-multiturn-chatbot.json
      /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-multiturn-chatbot.json > "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"
      rm -rf rag-app-multiturn-chatbot.json

      # config compose.env
      cat << EOF > compose.env
      export MODEL_DIRECTORY="/opt/data/${MODEL_DIRECTORY}"
      export NGC_API_KEY=${APIKEY}
      export USERID=$(id -u)
      export LLM_MS_GPU_ID=${LLM_MS_GPU_ID}
      export EMBEDDING_MS_GPU_ID=${EMBEDDING_MS_GPU_ID}
    EOF

      if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then 
        cat << EOF >> compose.env
        export POSTGRES_HOST_IP="${POSTGRES_HOST_IP}"
        export POSTGRES_PORT_NUMBER="${POSTGRES_PORT_NUMBER}"
        export POSTGRES_PASSWORD="${POSTGRES_PASSWORD}"
        export POSTGRES_USER="${POSTGRES_USER}"
        export POSTGRES_DB="${POSTGRES_DB}"
    EOF
      fi
    
      touch .initialize

      deploy_dcgm_exporter
    fi

    # start NGC RAG
    echo "Info: running the RAG application"
    source compose.env
    if [ -z "${PGVECTOR_VALUE}" ] || [ "${PGVECTOR_VALUE}" = "null" ]; then 
      echo "Info: running the pgvector container as the Vector Database"
      docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim --profile pgvector up -d
    else
      echo "Info: using the provided DSM PGVector as the Vector Database"
      docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim up -d
    fi
    
- path: /opt/dlvm/utils.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    error_exit() {
      echo "Error: $1" >&2
      vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
      exit 1
    }

    check_protocol() {
      local proxy_url=$1
      shift
      local supported_protocols=("$@")
      if [[ -n "${proxy_url}" ]]; then
        local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
        if [ -z "$protocol" ]; then
          echo "No specific protocol provided. Skipping protocol check."
          return 0
        fi
        local protocol_included=false
        for var in "${supported_protocols[@]}"; do
          if [[ "${protocol}" == "${var}" ]]; then
            protocol_included=true
            break
          fi
        done
        if [[ "${protocol_included}" == false ]]; then
          error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
        fi
      fi
    }

    # $@: list of supported protocols
    set_proxy() {
      local supported_protocols=("$@")

      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)

      HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
      HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
      if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
        echo "Info: The config-json was parsed, but no proxy settings were found."
        return 0
      fi

      check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
      check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"

      if ! grep -q 'http_proxy' /etc/environment; then
        sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
        export https_proxy=${HTTPS_PROXY_URL}
        export HTTP_PROXY=${HTTP_PROXY_URL}
        export HTTPS_PROXY=${HTTPS_PROXY_URL}
        export no_proxy=localhost,127.0.0.1" >> /etc/environment'
        source /etc/environment
      fi
      
      # Configure Docker to use a proxy
      sudo mkdir -p /etc/systemd/system/docker.service.d
      sudo bash -c 'echo "[Service]
      Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
      Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
      Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
      sudo systemctl daemon-reload
      sudo systemctl restart docker

      echo "Info: docker and system environment are now configured to use the proxy settings"
    }

    deploy_dcgm_exporter() {
      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')

      DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
      DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
      if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
        echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      else
        echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      fi
    }

    qui correspond au script suivant au format texte brut :

    #cloud-config
    write_files:
    - path: /opt/dlvm/dl_app.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        set -eu
        source /opt/dlvm/utils.sh
        trap 'error_exit "Unexpected error occurs at dl workload"' ERR
        set_proxy "http" "https"
        
        sudo mkdir -p /opt/data/
        sudo chown vmware:vmware /opt/data
        sudo chmod -R 775 /opt/data
        cd /opt/data/
    
        cat <<EOF > /opt/data/config.json
        {
          "_comment_1": "This provides default support for RAG v24.08: llama3-8b-instruct model",
          "_comment_2": "Update llm_ms_gpu_id: specifies the GPU device ID to make available to the inference server when using multiple GPU",
          "_comment_3": "Update embedding_ms_gpu_id: specifies the GPU ID used for embedding model processing when using multiple GPU",
          "rag": {
            "org_name": "nvidia",
            "org_team_name": "aiworkflows",
            "rag_name": "ai-chatbot-docker-workflow",
            "rag_version": "24.08",
            "rag_app": "rag-app-multiturn-chatbot",
            "nim_model_profile": "auto",
            "llm_ms_gpu_id": "0",
            "embedding_ms_gpu_id": "0",
            "model_directory": "model-cache",
            "ngc_cli_version": "3.41.2"
          }
        }
        EOF
    
        CONFIG_JSON=$(cat "/opt/data/config.json")
        required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_NAME" "RAG_VERSION" "RAG_APP" "NIM_MODEL_PROFILE" "LLM_MS_GPU_ID" "EMBEDDING_MS_GPU_ID" "MODEL_DIRECTORY" "NGC_CLI_VERSION")
    
        # Extract rag values from /opt/data/config.json
        for index in "${!required_vars[@]}"; do
          key="${required_vars[$index]}"
          jq_query=".rag.${key,,} | select (.!=null)"
          value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
          if [[ -z "${value}" ]]; then 
            error_exit "${key} is required but not set."
          else
            eval ${key}=\""${value}"\"
          fi
        done
    
        # Read parameters from config-json to connect DSM PGVector on RAG
        CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
        CONFIG_JSON_PGVECTOR=$(echo "${CONFIG_JSON_BASE64}" | base64 -d)
        PGVECTOR_VALUE=$(echo ${CONFIG_JSON_PGVECTOR} | jq -r '.rag.pgvector')
        if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then
          echo "Info: extract DSM PGVector parameters from config-json in XML"
          POSTGRES_USER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $4}')
          POSTGRES_PASSWORD=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $5}')
          POSTGRES_HOST_IP=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $6}')
          POSTGRES_PORT_NUMBER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $7}')
          POSTGRES_DB=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $8}')
    
          for var in POSTGRES_USER POSTGRES_PASSWORD POSTGRES_HOST_IP POSTGRES_PORT_NUMBER POSTGRES_DB; do
            if [ -z "${!var}" ]; then
              error_exit "${var} is not set."
            fi
          done
        fi
    
        gpu_info=$(nvidia-smi -L)
        echo "Info: the detected GPU info, $gpu_info"
        if [[ ${NIM_MODEL_PROFILE} == "auto" ]]; then 
          case "${gpu_info}" in
            *A100*)
              NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
              echo "Info: GPU type A100 detected. Setting tensorrt_llm-A100-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *H100*)
              NIM_MODEL_PROFILE="cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1"
              echo "Info: GPU type H100 detected. Setting tensorrt_llm-H100-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *L40S*)
              NIM_MODEL_PROFILE="d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80"
              echo "Info: GPU type L40S detected. Setting tensorrt_llm-L40S-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *)
              NIM_MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
              echo "Info: No supported GPU type detected (A100, H100, L40S). Setting vllm as the default NIM model profile."
              ;;
          esac
        else
          echo "Info: using the NIM model profile provided by the user, $NIM_MODEL_PROFILE"
        fi
    
        RAG_URI="${ORG_NAME}/${ORG_TEAM_NAME}/${RAG_NAME}:${RAG_VERSION}"
        RAG_FOLDER="${RAG_NAME}_v${RAG_VERSION}"
        NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"
    
        if [ ! -f .initialize ]; then
          # clean up
          rm -rf compose.env ngc* ${RAG_NAME}* ${MODEL_DIRECTORY}* .initialize
    
          # install ngc-cli
          wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip -q ngccli_linux.zip
          export PATH=`pwd`/ngc-cli:${PATH}
    
          APIKEY=""
          DEFAULT_REG_URI="nvcr.io"
    
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -z "${REGISTRY_URI_PATH}" ]]; then
            REGISTRY_URI_PATH=${DEFAULT_REG_URI}
            echo "Info: registry uri was empty. Using default: ${REGISTRY_URI_PATH}"
          fi
    
          if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${DEFAULT_REG_URI}"* ]]; then
            APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          fi
    
          if [ -z "${APIKEY}" ]; then
              error_exit "No APIKEY found"
          fi
    
          # config ngc-cli
          mkdir -p ~/.ngc
    
          cat << EOF > ~/.ngc/config
          [CURRENT]
          apikey = ${APIKEY}
          format_type = ascii
          org = ${ORG_NAME}
          team = ${ORG_TEAM_NAME}
          ace = no-ace
        EOF
          
          # Extract registry URI if path contains '/'
          if [[ ${REGISTRY_URI_PATH} == *"/"* ]]; then
            REGISTRY_URI=$(echo "${REGISTRY_URI_PATH}" | cut -d'/' -f1)
          else
            REGISTRY_URI=${REGISTRY_URI_PATH}
          fi
    
          REGISTRY_USER=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
          # Docker login if credentials are provided
          if [[ -n "${REGISTRY_USER}" && -n "${APIKEY}" ]]; then
            docker login -u ${REGISTRY_USER} -p ${APIKEY} ${REGISTRY_URI}
          else
            echo "Warning: the ${REGISTRY_URI} registry's username and password are invalid, Skipping Docker login."
          fi
    
          # DockerHub login for general components
          DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
          DOCKERHUB_URI=${DOCKERHUB_URI:-docker.io}
          if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
            docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD} ${DOCKERHUB_URI}
          else
            echo "Warning: ${DOCKERHUB_URI} not logged in"
          fi
    
          # Download RAG files
          ngc registry resource download-version ${RAG_URI}
    
          mkdir -p /opt/data/${MODEL_DIRECTORY}
    
          # Update the docker-compose YAML files to correct the issue with GPU free/non-free status reporting
          /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/docker-compose-nim-ms.yaml"> docker-compose-nim-ms.json
          jq --arg profile "${NIM_MODEL_PROFILE}" \
             '.services."nemollm-inference".environment.NIM_MANIFEST_ALLOW_UNSAFE = "1" |
              .services."nemollm-inference".environment.NIM_MODEL_PROFILE = $profile |
              .services."nemollm-inference".deploy.resources.reservations.devices[0].device_ids = ["${LLM_MS_GPU_ID:-0}"] |
              del(.services."nemollm-inference".deploy.resources.reservations.devices[0].count)' docker-compose-nim-ms.json > temp.json && mv temp.json docker-compose-nim-ms.json
          /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < docker-compose-nim-ms.json > "${RAG_FOLDER}/docker-compose-nim-ms.yaml"
          rm -rf docker-compose-nim-ms.json
    
          # Update docker-compose YAML files to config PGVector as the default databse
          /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"> rag-app-multiturn-chatbot.json
          jq '.services."chain-server".environment.APP_VECTORSTORE_NAME = "pgvector" |
             .services."chain-server".environment.APP_VECTORSTORE_URL = "${POSTGRES_HOST_IP:-pgvector}:${POSTGRES_PORT_NUMBER:-5432}" |
             .services."chain-server".environment.POSTGRES_PASSWORD = "${POSTGRES_PASSWORD:-password}" |
             .services."chain-server".environment.POSTGRES_USER = "${POSTGRES_USER:-postgres}" |
             .services."chain-server".environment.POSTGRES_DB = "${POSTGRES_DB:-api}"' rag-app-multiturn-chatbot.json > temp.json && mv temp.json rag-app-multiturn-chatbot.json
          /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-multiturn-chatbot.json > "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"
          rm -rf rag-app-multiturn-chatbot.json
    
          # config compose.env
          cat << EOF > compose.env
          export MODEL_DIRECTORY="/opt/data/${MODEL_DIRECTORY}"
          export NGC_API_KEY=${APIKEY}
          export USERID=$(id -u)
          export LLM_MS_GPU_ID=${LLM_MS_GPU_ID}
          export EMBEDDING_MS_GPU_ID=${EMBEDDING_MS_GPU_ID}
        EOF
    
          if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then 
            cat << EOF >> compose.env
            export POSTGRES_HOST_IP="${POSTGRES_HOST_IP}"
            export POSTGRES_PORT_NUMBER="${POSTGRES_PORT_NUMBER}"
            export POSTGRES_PASSWORD="${POSTGRES_PASSWORD}"
            export POSTGRES_USER="${POSTGRES_USER}"
            export POSTGRES_DB="${POSTGRES_DB}"
        EOF
          fi
        
          touch .initialize
    
          deploy_dcgm_exporter
        fi
    
        # start NGC RAG
        echo "Info: running the RAG application"
        source compose.env
        if [ -z "${PGVECTOR_VALUE}" ] || [ "${PGVECTOR_VALUE}" = "null" ]; then 
          echo "Info: running the pgvector container as the Vector Database"
          docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim --profile pgvector up -d
        else
          echo "Info: using the provided DSM PGVector as the Vector Database"
          docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim up -d
        fi
        
    - path: /opt/dlvm/utils.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        error_exit() {
          echo "Error: $1" >&2
          vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
          exit 1
        }
    
        check_protocol() {
          local proxy_url=$1
          shift
          local supported_protocols=("$@")
          if [[ -n "${proxy_url}" ]]; then
            local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
            if [ -z "$protocol" ]; then
              echo "No specific protocol provided. Skipping protocol check."
              return 0
            fi
            local protocol_included=false
            for var in "${supported_protocols[@]}"; do
              if [[ "${protocol}" == "${var}" ]]; then
                protocol_included=true
                break
              fi
            done
            if [[ "${protocol_included}" == false ]]; then
              error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
            fi
          fi
        }
    
        # $@: list of supported protocols
        set_proxy() {
          local supported_protocols=("$@")
    
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
    
          HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
          HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
          if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
            echo "Info: The config-json was parsed, but no proxy settings were found."
            return 0
          fi
    
          check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
          check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
    
          if ! grep -q 'http_proxy' /etc/environment; then
            sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
            export https_proxy=${HTTPS_PROXY_URL}
            export HTTP_PROXY=${HTTP_PROXY_URL}
            export HTTPS_PROXY=${HTTPS_PROXY_URL}
            export no_proxy=localhost,127.0.0.1" >> /etc/environment'
            source /etc/environment
          fi
          
          # Configure Docker to use a proxy
          sudo mkdir -p /etc/systemd/system/docker.service.d
          sudo bash -c 'echo "[Service]
          Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
          Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
          Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
          sudo systemctl daemon-reload
          sudo systemctl restart docker
    
          echo "Info: docker and system environment are now configured to use the proxy settings"
        }
    
        deploy_dcgm_exporter() {
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
    
          DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
          DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
          if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
            echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
            docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
          else
            echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
            docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
          fi
        }
  • Entrez les propriétés d'installation du pilote invité vGPU, telles que vgpu-license et nvidia-portal-api-key.
  • Fournissez les valeurs des propriétés requises pour un environnement déconnecté si nécessaire.

Reportez-vous à la section Propriétés OVF des VM à apprentissage profond.

Sortie
  • Journaux d'installation du pilote invité vGPU dans /var/log/vgpu-install.log.

    Pour vérifier que le pilote invité vGPU est installé, connectez-vous à la VM via SSH et exécutez la commande nvidia-smi.

  • Journaux de script cloud-init dans /var/log/dl.log.

    Pour suivre la progression du déploiement, exécutez tail -f /var/log/dl.log.

  • Exemple d'application Web d'agent conversationnel accessible à l'adresse http://dl_vm_ip:3001

    Vous pouvez charger votre propre base de connaissances.

Attribuer une adresse IP statique à une VM à apprentissage profond dans VMware Private AI Foundation with NVIDIA

Par défaut, les images de VM à apprentissage profond sont configurées avec l'attribution d'adresses DHCP. Pour déployer une VM à apprentissage profond avec une adresse IP statique directement sur un cluster vSphere, vous devez ajouter du code supplémentaire à la section cloud-init.

Sur vSphere with Tanzu, l'attribution d'adresses IP est déterminée par la configuration réseau du superviseur dans NSX.

Procédure

  1. Créez un script cloud-init au format de texte brut pour la charge de travail DL que vous prévoyez d'utiliser.
  2. Ajoutez le code supplémentaire suivant au script cloud-init.
    #cloud-config
    <instructions_for_your_DL_workload>
    
    manage_etc_hosts: true
     
    write_files:
      - path: /etc/netplan/50-cloud-init.yaml
        permissions: '0600'
        content: |
          network:
            version: 2
            renderer: networkd
            ethernets:
              ens33:
                dhcp4: false # disable DHCP4
                addresses: [x.x.x.x/x]  # Set the static IP address and mask
                routes:
                    - to: default
                      via: x.x.x.x # Configure gateway
                nameservers:
                  addresses: [x.x.x.x, x.x.x.x] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
     
    runcmd:
      - netplan apply
  3. Codez le script cloud-init obtenu au format base64.
  4. Définissez le script cloud-init obtenu au format base64 comme valeur sur le paramètre OVF user-data de l'image de VM à apprentissage profond.

Exemple : Attribution d'une adresse IP statique à un exemple de charge de travail CUDA

Pour obtenir un exemple de VM à apprentissage profond avec une charge de travail DL Exemple CUDA :

Élément VM à apprentissage profond Valeur d'exemple
Image de charge de travail DL nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
Adresse IP 10.199.118.245
Préfixe de sous-réseau /25
Passerelle 10.199.118.253
Serveurs DNS
  • 10.142.7.1
  • 10.132.7.1

fournissez le code cloud-init suivant :

I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIG52Y3IuaW8vbnZpZGlhL2s4cy9jdWRhLXNhbXBsZTp2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4CgptYW5hZ2VfZXRjX2hvc3RzOiB0cnVlCiAKd3JpdGVfZmlsZXM6CiAgLSBwYXRoOiAvZXRjL25ldHBsYW4vNTAtY2xvdWQtaW5pdC55YW1sCiAgICBwZXJtaXNzaW9uczogJzA2MDAnCiAgICBjb250ZW50OiB8CiAgICAgIG5ldHdvcms6CiAgICAgICAgdmVyc2lvbjogMgogICAgICAgIHJlbmRlcmVyOiBuZXR3b3JrZAogICAgICAgIGV0aGVybmV0czoKICAgICAgICAgIGVuczMzOgogICAgICAgICAgICBkaGNwNDogZmFsc2UgIyBkaXNhYmxlIERIQ1A0CiAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE5OS4xMTguMjQ1LzI1XSAgIyBTZXQgdGhlIHN0YXRpYyBJUCBhZGRyZXNzIGFuZCBtYXNrCiAgICAgICAgICAgIHJvdXRlczoKICAgICAgICAgICAgICAgIC0gdG86IGRlZmF1bHQKICAgICAgICAgICAgICAgICAgdmlhOiAxMC4xOTkuMTE4LjI1MyAjIENvbmZpZ3VyZSBnYXRld2F5CiAgICAgICAgICAgIG5hbWVzZXJ2ZXJzOgogICAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE0Mi43LjEsIDEwLjEzMi43LjFdICMgUHJvdmlkZSB0aGUgRE5TIHNlcnZlciBhZGRyZXNzLiBTZXBhcmF0ZSBtdWxpdHBsZSBETlMgc2VydmVyIGFkZHJlc3NlcyB3aXRoIGNvbW1hcy4KIApydW5jbWQ6CiAgLSBuZXRwbGFuIGFwcGx5

qui correspond au script suivant au format texte brut :

#cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

manage_etc_hosts: true
 
write_files:
  - path: /etc/netplan/50-cloud-init.yaml
    permissions: '0600'
    content: |
      network:
        version: 2
        renderer: networkd
        ethernets:
          ens33:
            dhcp4: false # disable DHCP4
            addresses: [10.199.118.245/25]  # Set the static IP address and mask
            routes:
                - to: default
                  via: 10.199.118.253 # Configure gateway
            nameservers:
              addresses: [10.142.7.1, 10.132.7.1] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
 
runcmd:
  - netplan apply

Configurer une VM à apprentissage profond avec un serveur proxy

Pour connecter votre VM à apprentissage profond à Internet dans un environnement déconnecté où l'accès à Internet se fait sur un serveur proxy, vous devez fournir les détails du serveur proxy dans le fichier config.json de la machine virtuelle.

Procédure

  1. Créez un fichier JSON avec les propriétés du serveur proxy.
    Serveur proxy ne nécessitant aucune authentification
    {  
      "http_proxy": "protocol://ip-address-or-fqdn:port",
      "https_proxy": "protocol://ip-address-or-fqdn:port"
    }
    Serveur proxy nécessitant une authentification
    {  
      "http_proxy": "protocol://username:password@ip-address-or-fqdn:port",
      "https_proxy": "protocol://username:password@ip-address-or-fqdn:port"
    }

    Où :

    • protocol est le protocole de communication utilisé par le serveur proxy, tel que http ou https.
    • username et password sont les informations d'identification pour l'authentification sur le serveur proxy. Si ce dernier ne nécessite pas d'authentification, ignorez ces paramètres.
    • Ip-address-or-fqdn : adresse IP ou nom d'hôte du serveur proxy.
    • Port : numéro de port sur lequel le serveur proxy écoute les demandes entrantes.
  2. Codez le code JSON résultant au format base64.
  3. Lorsque vous déployez l'image de la VM à apprentissage profond, ajoutez la valeur codée à la propriété OVF config-json.