Wenn Sie eine Deep Learning-VM in vSphere IaaS control plane mithilfe von kubectl oder direkt auf einem vSphere-Cluster bereitstellen, müssen Sie benutzerdefinierte VM-Eigenschaften eingeben.

Informationen zu Deep Learning-VM-Images in VMware Private AI Foundation with NVIDIA finden Sie unter Informationen zu Deep Learning-VM-Images in VMware Private AI Foundation with NVIDIA.

OVF-Eigenschaften von Deep Learning-VMs

Wenn Sie eine Deep Learning-VM bereitstellen, müssen Sie benutzerdefinierte VM-Eigenschaften ausfüllen, um die Konfiguration des Linux-Betriebssystems, die Bereitstellung des vGPU-Gasttreibers und die Bereitstellung und Konfiguration von NGC-Containern für die DL-Arbeitslasten zu automatisieren.

Das aktuelle Deep Learning-VM-Image verfügt über die folgenden OVF-Eigenschaften:

Kategorie Parameter Bezeichnung im vSphere Client Beschreibung
Eigenschaften des Basisbetriebssystems instance-id Instanz-ID Erforderlich. Eine eindeutige Instanz-ID für die VM-Instanz.

Eine Instanz-ID identifiziert eine Instanz eindeutig. Wenn sich eine Instanz-ID ändert, behandelt cloud-init die Instanz als neue Instanz und führt den cloud-init-Prozess erneut für aus.

hostname Hostname Erforderlich. Der Hostname der Appliance.
seedfrom URL zum Speichern von Instanzdaten aus Optional. Eine URL zum Abrufen des Werts für den Parameter „user-data“ und der Metadaten.
public-keys Öffentlicher SSH-Schlüssel Wenn angegeben, füllt die Instanz die SSH-authorized_keys des Standardbenutzers mit diesem Wert auf.
user-data Codierte Benutzerdaten

Ein Satz von Skripts oder anderen Metadaten, die zum Zeitpunkt der Bereitstellung in die VM eingefügt werden.

Diese Eigenschaft stellt den tatsächlichen Inhalt des cloud-init-Skripts dar. Dieser Wert muss base64-verschlüsselt werden.

password Standardbenutzerkennwort Erforderlich. Das Kennwort für das vmware-Standardbenutzerkonto.

vGPU-Treiberinstallation

vgpu-license vGPU-Lizenz Erforderlich. Das Konfigurationstoken des NVIDIA vGPU-Clients. Das Token wird in der Datei /etc/nvidia/ClientConfigToken/client_configuration_token.tok gespeichert.
nvidia-portal-api-key API-Schlüssel des NVIDIA-Portals

In einer verbundenen Umgebung erforderlich. Der API-Schlüssel, den Sie vom NVIDIA-Lizenzierungsportal heruntergeladen haben. Der Schlüssel ist für die Installation des vGPU-Gasttreibers erforderlich.

vgpu-host-driver-version Version des vGPU-Hosttreibers Installieren Sie direkt diese Version des vGPU-Gasttreibers.
vgpu-url URL für Air-Gap-vGPU-Downloads

In einer nicht verbundenen Umgebung erforderlich. Die URL zum Herunterladen des vGPU-Gasttreibers. Informationen zur notwendigen Konfiguration des lokalen Web-Servers finden Sie unter Vorbereiten von VMware Cloud Foundation für die Bereitstellung von Private AI-Arbeitslasten.

Automatisierung von DL-Arbeitslasten registry-uri Registrierungs-URI Erforderlich in einer nicht verbundenen Umgebung oder wenn Sie eine private Containerregistrierung verwenden möchten, um das Herunterladen von Images aus dem Internet zu vermeiden. Der URI einer privaten Containerregistrierung mit den Deep Learning-Arbeitslast-Container-Images.

Erforderlich, wenn Sie auf eine private Registrierung in user-data oder image-oneliner verweisen.

registry-user Registrierungsbenutzername Erforderlich, wenn Sie eine private Containerregistrierung verwenden, für die eine Standardauthentifizierung erforderlich ist.
registry-passwd Registrierungskennwort Erforderlich, wenn Sie eine private Containerregistrierung verwenden, für die eine Standardauthentifizierung erforderlich ist.
registry-2-uri Sekundärer Registrierungs-URI Erforderlich, wenn Sie eine zweite private Containerregistrierung verwenden, die auf Docker basiert und Standardauthentifizierung erfordert.

Wenn Sie beispielsweise eine Deep Learning-VM mit der vorinstallierten NVIDIA RAG-DL-Arbeitslast bereitstellen, wird ein pgvector-Image aus dem Docker-Hub heruntergeladen. Sie können die Parameter registry-2- verwenden, um eine Begrenzung der Pull-Rate für docker.io zu umgehen.

registry-2-user Sekundärer Registrierungsbenutzername Erforderlich, wenn Sie eine zweite private Containerregistrierung verwenden.
registry-2-passwd Sekundäres Registrierungskennwort Erforderlich, wenn Sie eine zweite private Containerregistrierung verwenden.
image-oneliner Codierter einzeiliger Befehl Ein einzeilige Bash-Befehl, der bei der VM-Bereitstellung ausgeführt wird. Dieser Wert muss base64-verschlüsselt werden.

Sie können diese Eigenschaft verwenden, um den DL-Arbeitslastcontainer anzugeben, den Sie bereitstellen möchten, z. B. PyTorch oder TensorFlow. Weitere Informationen finden Sie unter Deep Learning-Arbeitslasten in VMware Private AI Foundation with NVIDIA.

Vorsicht: Vermeiden Sie die Verwendung von user-data und image-oneliner.
docker-compose-uri Codierte Docker-Erstellungsdatei

Erforderlich, wenn Sie eine Docker-Erstellungsdatei benötigen, um den DL-Arbeitslastcontainer zu starten. Der Inhalt der Datei docker-compose.yaml, die bei der Bereitstellung in die virtuelle Maschine eingefügt wird, nachdem die virtuelle Maschine mit aktivierter GPU gestartet wurde. Dieser Wert muss base64-verschlüsselt werden.

config-json Codierte config.json Der Inhalt einer Konfigurationsdatei zum Hinzufügen der folgenden Details:

Dieser Wert muss base64-verschlüsselt werden.

conda-environment-install Installation der Conda-Umgebung Eine kommagetrennte Liste der Conda-Umgebungen, die nach Abschluss der VM-Bereitstellung automatisch installiert werden sollen.

Verfügbare Umgebungen: pytorch2.3_py3.12, pytorch1.13.1_py3.10, tf2.16.1_py3.12 und tf1.15.5_py3.7.

Deep Learning-Arbeitslasten in VMware Private AI Foundation with NVIDIA

Sie können eine Deep Learning-VM zusätzlich zu ihren eingebetteten Komponenten mit einer unterstützten Deep Learning (DL)-Arbeitslast bereitstellen. Die DL-Arbeitslasten werden aus dem NVIDIA NGC-Katalog heruntergeladen und sind GPU-optimiert und von NVIDIA und VMware von Broadcom validiert.

Eine Übersicht über die Deep Learning-VM-Images finden Sie unter Informationen zu Deep Learning-VM-Images in VMware Private AI Foundation with NVIDIA.

CUDA-Beispiel

Sie können eine Deep Learning-VM mit ausgeführten CUDA-Beispielen verwenden, um die Vektorhinzufügung, die Gravitations-N-Körper-Simulation oder andere Beispiele auf einer VM zu untersuchen. Weitere Informationen finden Sie auf der Seite CUDA-Beispiele.

Nachdem die Deep Learning-VM gestartet wurde, führt sie eine CUDA-Beispielarbeitslast aus, um den vGPU-Gasttreiber zu testen. Sie können die Testausgabe in der Datei /var/log/dl.log überprüfen.

Tabelle 1. CUDA-Beispiel-Container-Image
Komponente Beschreibung
Container-Image
nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag
Beispiel:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

Informationen zu den CUDA-Beispiel-Container-Images, die für Deep Learning-VMs unterstützt werden, finden Sie unter Versionshinweise zu VMware Deep Learning VM.

Erforderliche Eingaben Um eine CUDA-Beispielarbeitslast bereitzustellen, müssen Sie die OVF-Eigenschaften für die Deep Learning-VM wie folgt festlegen:
  • Verwenden Sie eine der folgenden Eigenschaften, die für das CUDA-Beispiel-Image spezifisch sind.
    • Cloud-init-Skript. Codieren Sie es im base64-Format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          set_proxy "http" "https" "socks5"
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          deploy_dcgm_exporter
      
          echo "Info: running the vectoradd CUDA container"
          CUDA_SAMPLE_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/cuda-sample"
          CUDA_SAMPLE_VERSION="ngc_image_tag"
          docker run -d $CUDA_SAMPLE_IMAGE:$CUDA_SAMPLE_VERSION
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
        
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Geben Sie beispielsweise für „vectoradd-cuda11.7.1-ubi8“ das folgende Skript im base64-Format an:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICBzZXRfcHJveHkgImh0dHAiICJodHRwcyIgInNvY2tzNSIKICAgIHRyYXAgJ2Vycm9yX2V4aXQgIlVuZXhwZWN0ZWQgZXJyb3Igb2NjdXJzIGF0IGRsIHdvcmtsb2FkIicgRVJSCiAgICBERUZBVUxUX1JFR19VUkk9Im52Y3IuaW8iCiAgICBSRUdJU1RSWV9VUklfUEFUSD0kKGdyZXAgcmVnaXN0cnktdXJpIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKCiAgICBpZiBbWyAteiAiJFJFR0lTVFJZX1VSSV9QQVRIIiBdXTsgdGhlbgogICAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGlzIG51bGwgb3IgZW1wdHksIHVzZSB0aGUgZGVmYXVsdCB2YWx1ZQogICAgICBSRUdJU1RSWV9VUklfUEFUSD0kREVGQVVMVF9SRUdfVVJJCiAgICAgIGVjaG8gIlJFR0lTVFJZX1VSSV9QQVRIIHdhcyBlbXB0eS4gVXNpbmcgZGVmYXVsdDogJFJFR0lTVFJZX1VSSV9QQVRIIgogICAgZmkKICAgIAogICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBjb250YWlucyAnLycsIGV4dHJhY3QgdGhlIFVSSSBwYXJ0CiAgICBpZiBbWyAkUkVHSVNUUllfVVJJX1BBVEggPT0gKiIvIiogXV07IHRoZW4KICAgICAgUkVHSVNUUllfVVJJPSQoZWNobyAiJFJFR0lTVFJZX1VSSV9QQVRIIiB8IGN1dCAtZCcvJyAtZjEpCiAgICBlbHNlCiAgICAgIFJFR0lTVFJZX1VSST0kUkVHSVNUUllfVVJJX1BBVEgKICAgIGZpCiAgCiAgICBSRUdJU1RSWV9VU0VSTkFNRT0kKGdyZXAgcmVnaXN0cnktdXNlciAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBSRUdJU1RSWV9QQVNTV09SRD0kKGdyZXAgcmVnaXN0cnktcGFzc3dkIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIGlmIFtbIC1uICIkUkVHSVNUUllfVVNFUk5BTUUiICYmIC1uICIkUkVHSVNUUllfUEFTU1dPUkQiIF1dOyB0aGVuCiAgICAgIGRvY2tlciBsb2dpbiAtdSAkUkVHSVNUUllfVVNFUk5BTUUgLXAgJFJFR0lTVFJZX1BBU1NXT1JEICRSRUdJU1RSWV9VUkkKICAgIGVsc2UKICAgICAgZWNobyAiV2FybmluZzogdGhlIHJlZ2lzdHJ5J3MgdXNlcm5hbWUgYW5kIHBhc3N3b3JkIGFyZSBpbnZhbGlkLCBTa2lwcGluZyBEb2NrZXIgbG9naW4uIgogICAgZmkKICAgIAogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIKCiAgICBlY2hvICJJbmZvOiBydW5uaW5nIHRoZSB2ZWN0b3JhZGQgQ1VEQSBjb250YWluZXIiCiAgICBDVURBX1NBTVBMRV9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvY3VkYS1zYW1wbGUiCiAgICBDVURBX1NBTVBMRV9WRVJTSU9OPSJ2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4IgogICAgZG9ja2VyIHJ1biAtZCAkQ1VEQV9TQU1QTEVfSU1BR0U6JENVREFfU0FNUExFX1ZFUlNJT04KCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQogIAogICAgICBjaGVja19wcm90b2NvbCAiJHtIVFRQX1BST1hZX1VSTH0iICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IgogICAgICBjaGVja19wcm90b2NvbCAiJHtIVFRQU19QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKCiAgICAgIGlmICEgZ3JlcCAtcSAnaHR0cF9wcm94eScgL2V0Yy9lbnZpcm9ubWVudDsgdGhlbgogICAgICAgIHN1ZG8gYmFzaCAtYyAnZWNobyAiZXhwb3J0IGh0dHBfcHJveHk9JHtIVFRQX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgaHR0cHNfcHJveHk9JHtIVFRQU19QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IEhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUFNfUFJPWFk9JHtIVFRQU19QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IG5vX3Byb3h5PWxvY2FsaG9zdCwxMjcuMC4wLjEiID4+IC9ldGMvZW52aXJvbm1lbnQnCiAgICAgICAgc291cmNlIC9ldGMvZW52aXJvbm1lbnQKICAgICAgZmkKICAgICAgCiAgICAgICMgQ29uZmlndXJlIERvY2tlciB0byB1c2UgYSBwcm94eQogICAgICBzdWRvIG1rZGlyIC1wIC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZAogICAgICBzdWRvIGJhc2ggLWMgJ2VjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mJwogICAgICBzdWRvIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN1ZG8gc3lzdGVtY3RsIHJlc3RhcnQgZG9ja2VyCgogICAgICBlY2hvICJJbmZvOiBkb2NrZXIgYW5kIHN5c3RlbSBlbnZpcm9ubWVudCBhcmUgbm93IGNvbmZpZ3VyZWQgdG8gdXNlIHRoZSBwcm94eSBzZXR0aW5ncyIKICAgIH0KCiAgICBkZXBsb3lfZGNnbV9leHBvcnRlcigpIHsKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKICAgICAgRENHTV9FWFBPUlRfUFVCTElDPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5leHBvcnRfZGNnbV90b19wdWJsaWMgLy8gZW1wdHknKQoKICAgICAgRENHTV9FWFBPUlRFUl9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvZGNnbS1leHBvcnRlciIKICAgICAgRENHTV9FWFBPUlRFUl9WRVJTSU9OPSIzLjIuNS0zLjEuOC11YnVudHUyMi4wNCIKICAgICAgaWYgWyAteiAiJHtEQ0dNX0VYUE9SVF9QVUJMSUN9IiBdIHx8IFsgIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgIT0gInRydWUiIF07IHRoZW4KICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgbGlzdGVuaW5nIG9ubHkgb24gbG9jYWxob3N0ICgxMjcuMC4wLjE6OTQwMCkiCiAgICAgICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLXAgMTI3LjAuMC4xOjk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGVsc2UKICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgZXhwb3NlZCBvbiBhbGwgbmV0d29yayBpbnRlcmZhY2VzICgwLjAuMC4wOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGZpCiAgICB9

      was dem folgenden Skript im Klartextformat entspricht:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          set_proxy "http" "https" "socks5"
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
          
          deploy_dcgm_exporter
      
          echo "Info: running the vectoradd CUDA container"
          CUDA_SAMPLE_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/cuda-sample"
          CUDA_SAMPLE_VERSION="vectoradd-cuda11.7.1-ubi8"
          docker run -d $CUDA_SAMPLE_IMAGE:$CUDA_SAMPLE_VERSION
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
        
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Einzeiliges Image. Im base64-Format codieren
      docker run -d nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag

      Geben Sie beispielsweise für „vectoradd-cuda11.7.1-ubi8“ das folgende Skript im base64-Format an:

      ZG9ja2VyIHJ1biAtZCBudmNyLmlvL252aWRpYS9rOHMvY3VkYS1zYW1wbGU6dmVjdG9yYWRkLWN1ZGExMS43LjEtdWJpOA==

      was dem folgenden Skript im Klartextformat entspricht:

      docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
  • Geben Sie die Installationseigenschaften des vGPU-Gasttreibers ein, wie z. B. vgpu-license und nvidia-portal-api-key.
  • Geben Sie nach Bedarf Werte für die Eigenschaften an, die für eine getrennte Umgebung erforderlich sind.

Weitere Informationen finden Sie unter OVF-Eigenschaften von Deep Learning-VMs.

Ausgabe
  • Installationsprotokolle für den vGPU-Gasttreiber in /var/log/vgpu-install.log.

    Führen Sie den folgenden Befehl aus, um sicherzustellen, dass der vGPU-Gasttreiber installiert und die Lizenz zugeteilt ist:

    nvidia-smi -q |grep -i license
  • Cloud-init-Skriptprotokolle in /var/log/dl.log.

PyTorch

Sie können eine Deep Learning-VM mit einer PyTorch-Bibliothek verwenden, um Konversations-KI, NLP und andere Arten von KI-Modellen auf einer VM zu erkunden. Weitere Informationen finden Sie auf der Seite PyTorch.

Nach dem Start der Deep Learning-VM wird eine JupyterLab-Instanz mit installierten und konfigurierten PyTorch-Paketen gestartet.

Tabelle 2. PyTorch-Container-Image
Komponente Beschreibung
Container-Image
nvcr.io/nvidia/pytorch-pb24h1:ngc_image_tag
Beispiel:
nvcr.io/nvidia/pytorch-pb24h1:24.03.02-py3

Informationen zu den PyTorch-Container-Images, die für Deep Learning-VMs unterstützt werden, finden Sie unter Versionshinweise zu VMware Deep Learning VM.

Erforderliche Eingaben Um eine PyTorch-Arbeitslast bereitzustellen, müssen Sie die OVF-Eigenschaften für die Deep Learning-VM wie folgt festlegen:
  • Verwenden Sie eine der folgenden Eigenschaften, die für das PyTorch-Image spezifisch sind.
    • Cloud-init-Skript. Codieren Sie es im base64-Format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the PyTorch container"
          PYTORCH_IMAGE="$REGISTRY_URI_PATH/nvidia/pytorch-pb24h1"
          PYTORCH_VERSION="ngc_image_tag"
          docker run -d --gpus all -p 8888:8888 $PYTORCH_IMAGE:$PYTORCH_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Geben Sie beispielsweise für „pytorch-pb24h1:24.03.02-py3“ das folgende Skript im base64-Format an:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC1wIDg4ODg6ODg4OCAkUkVHSVNUUllfVVJJX1BBVEgvbnZpZGlhL3B5dG9yY2g6MjMuMTAtcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2UKCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudAogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIGVjaG8gIltTZXJ2aWNlXQogICAgICBFbnZpcm9ubWVudD1cIkhUVFBfUFJPWFk9JHtIVFRQX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIkhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiTk9fUFJPWFk9bG9jYWxob3N0LDEyNy4wLjAuMVwiIiA+IC9ldGMvc3lzdGVtZC9zeXN0ZW0vZG9ja2VyLnNlcnZpY2UuZC9wcm94eS5jb25mCiAgICAgIHN5c3RlbWN0bCBkYWVtb24tcmVsb2FkCiAgICAgIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9

      was dem folgenden Skript im Klartextformat entspricht:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the PyTorch container"
          PYTORCH_IMAGE="$REGISTRY_URI_PATH/nvidia/pytorch-pb24h1"
          PYTORCH_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8888:8888 $PYTORCH_IMAGE:$PYTORCH_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Einzeiliges Image. Codieren Sie es im base64-Format.
      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch-pb24h1:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      Geben Sie beispielsweise für „pytorch-pb24h1:24.03.02-py3“ das folgende Skript im base64-Format an:

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvcHl0b3JjaC1wYjI0aDE6MjQuMDMuMDItcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2U=

      was dem folgenden Skript im Klartextformat entspricht:

      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch-pb24h1:24.03.02-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Geben Sie die Installationseigenschaften des vGPU-Gasttreibers ein, wie z. B. vgpu-license und nvidia-portal-api-key.
  • Geben Sie nach Bedarf Werte für die Eigenschaften an, die für eine getrennte Umgebung erforderlich sind.

Weitere Informationen finden Sie unter OVF-Eigenschaften von Deep Learning-VMs.

Ausgabe
  • Installationsprotokolle für den vGPU-Gasttreiber in /var/log/vgpu-install.log.

    Um zu überprüfen, ob der vGPU-Gasttreiber installiert ist, führen Sie den Befehl nvidia-smi aus.

  • Cloud-init-Skriptprotokolle in /var/log/dl.log.
  • PyTorch-Container.

    Um zu überprüfen, ob der PyTorch-Container ausgeführt wird, führen Sie die Befehle sudo docker ps -a und sudo docker logs container_id aus.

  • JupyterLab-Instanz, auf die Sie unter http://dl_vm_ip:8888 zugreifen können.

    Stellen Sie im Terminal von JupyterLab sicher, dass die folgenden Funktionen im Notizbuch verfügbar sind:

    • Um zu überprüfen, ob JupyterLab auf die vGPU-Ressource zugreifen kann, führen Sie nvidia-smi aus.
    • Um sicherzustellen, dass die PyTorch-bezogenen Pakete installiert sind, führen Sie pip show aus.

TensorFlow

Sie können eine Deep Learning-VM mit einer TensorFlow-Bibliothek verwenden, um Konversations-KI, NLP und andere Arten von KI-Modellen auf einer VM zu erkunden. Weitere Informationen finden Sie auf der Seite TensorFlow.

Nachdem die Deep Learning-VM gestartet wurde, startet sie eine JupyterLab-Instanz mit installierten und konfigurierten TensorFlow-Paketen.

Tabelle 3. TensorFlow-Container-Image
Komponente Beschreibung
Container-Image
nvcr.io/nvidia/tensorflow-pb24h1:ngc_image_tag

Beispiel:

nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3

Informationen zu den TensorFlow-Container-Images, die für Deep Learning-VMs unterstützt werden, finden Sie unter Versionshinweise zu VMware Deep Learning VM.

Erforderliche Eingaben Um eine TensorFlow-Arbeitslast bereitzustellen, müssen Sie die OVF-Eigenschaften für die Deep Learning-VM wie folgt festlegen:
  • Verwenden Sie eine der folgenden Eigenschaften, die für das TensorFlow-Image spezifisch sind.
    • Cloud-init-Skript. Codieren Sie es im base64-Format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the Tensorflow container"    
          TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
          TENSORFLOW_VERSION="ngc_image_tag"
          docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Geben Sie beispielsweise für „tensorflow-pb24h1:24.03.02-tf2-py3“ das folgende Skript im base64-Format an:

      #cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    set -eu
    source /opt/dlvm/utils.sh
    trap 'error_exit "Unexpected error occurs at dl workload"' ERR
    set_proxy "http" "https" "socks5"
    
    DEFAULT_REG_URI="nvcr.io"
    REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

    if [[ -z "$REGISTRY_URI_PATH" ]]; then
      # If REGISTRY_URI_PATH is null or empty, use the default value
      REGISTRY_URI_PATH=$DEFAULT_REG_URI
      echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
    fi
    
    # If REGISTRY_URI_PATH contains '/', extract the URI part
    if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
      REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
    else
      REGISTRY_URI=$REGISTRY_URI_PATH
    fi
  
    REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
      docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
    else
      echo "Warning: the registry's username and password are invalid, Skipping Docker login."
    fi

    deploy_dcgm_exporter

    CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
    enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')

    if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
      # Generate a random jupyter token
      TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
      # Set the token to guestinfo
      vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
      echo "Info: JupyterLab notebook access token, $TOKEN"
    else
      TOKEN=""
    fi

    echo "Info: running the Tensorflow container"    
    TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
    TENSORFLOW_VERSION="24.03.02-tf2-py3"
    docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
    
- path: /opt/dlvm/utils.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    error_exit() {
      echo "Error: $1" >&2
      vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
      exit 1
    }

    check_protocol() {
      local proxy_url=$1
      shift
      local supported_protocols=("$@")
      if [[ -n "${proxy_url}" ]]; then
        local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
        if [ -z "$protocol" ]; then
          echo "No specific protocol provided. Skipping protocol check."
          return 0
        fi
        local protocol_included=false
        for var in "${supported_protocols[@]}"; do
          if [[ "${protocol}" == "${var}" ]]; then
            protocol_included=true
            break
          fi
        done
        if [[ "${protocol_included}" == false ]]; then
          error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
        fi
      fi
    }

    # $@: list of supported protocols
    set_proxy() {
      local supported_protocols=("$@")

      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)

      HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
      HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
      if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
        echo "Info: The config-json was parsed, but no proxy settings were found."
        return 0
      fi

      check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
      check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"

      if ! grep -q 'http_proxy' /etc/environment; then
        sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
        export https_proxy=${HTTPS_PROXY_URL}
        export HTTP_PROXY=${HTTP_PROXY_URL}
        export HTTPS_PROXY=${HTTPS_PROXY_URL}
        export no_proxy=localhost,127.0.0.1" >> /etc/environment'
        source /etc/environment
      fi
      
      # Configure Docker to use a proxy
      sudo mkdir -p /etc/systemd/system/docker.service.d
      sudo bash -c 'echo "[Service]
      Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
      Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
      Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
      sudo systemctl daemon-reload
      sudo systemctl restart docker

      echo "Info: docker and system environment are now configured to use the proxy settings"
    }

    deploy_dcgm_exporter() {
      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')

      DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
      DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
      if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
        echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      else
        echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      fi
    }

      was dem folgenden Skript im Klartextformat entspricht:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          enableJupyterAuth=$(echo "${CONFIG_JSON}" | jq -r '.enable_jupyter_auth // empty')
      
          if [ -z "${enableJupyterAuth}" ] || [ "${enableJupyterAuth}" == true ]; then
            # Generate a random jupyter token
            TOKEN=$(python3 -c "import secrets; print(secrets.token_hex(32))")
            # Set the token to guestinfo
            vmtoolsd --cmd "info-set guestinfo.dlworkload.jupyterlab.token $TOKEN"
            echo "Info: JupyterLab notebook access token, $TOKEN"
          else
            TOKEN=""
          fi
      
          echo "Info: running the Tensorflow container"    
          TENSORFLOW_IMAGE="$REGISTRY_URI_PATH/nvidia/tensorflow-pb24h1"
          TENSORFLOW_VERSION="24.03.02-tf2-py3"
          docker run -d --gpus all -p 8888:8888 $TENSORFLOW_IMAGE:$TENSORFLOW_VERSION /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Einzeiliges Image. Codieren Sie es im base64-Format.
      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      Geben Sie beispielsweise für „tensorflow-pb24h1:24.03.02“ das folgende Skript im base64-Format an:

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvdGVuc29yZmxvdy1wYjI0aDE6MjQuMDMuMDItdGYyLXB5MyAvdXNyL2xvY2FsL2Jpbi9qdXB5dGVyIGxhYiAtLWFsbG93LXJvb3QgLS1pcD0qIC0tcG9ydD04ODg4IC0tbm8tYnJvd3NlciAtLU5vdGVib29rQXBwLnRva2VuPScnIC0tTm90ZWJvb2tBcHAuYWxsb3dfb3JpZ2luPScqJyAtLW5vdGVib29rLWRpcj0vd29ya3NwYWNl

      was dem folgenden Skript im Klartextformat entspricht:

      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Geben Sie die Installationseigenschaften des vGPU-Gasttreibers ein, wie z. B. vgpu-license und nvidia-portal-api-key.
  • Geben Sie nach Bedarf Werte für die Eigenschaften an, die für eine getrennte Umgebung erforderlich sind.

Weitere Informationen finden Sie unter OVF-Eigenschaften von Deep Learning-VMs.

Ausgabe
  • Installationsprotokolle für den vGPU-Gasttreiber in /var/log/vgpu-install.log.

    Um zu überprüfen, ob der vGPU-Gasttreiber installiert ist, melden Sie sich über SSH bei der VM an und führen Sie den Befehl nvidia-smi aus.

  • Cloud-init-Skriptprotokolle in /var/log/dl.log.
  • TensorFlow-Container.

    Um zu überprüfen, ob der TensorFlow-Container ausgeführt wird, führen Sie die Befehle sudo docker ps -a und sudo docker logs container_id aus.

  • JupyterLab-Instanz, auf die Sie unter http://dl_vm_ip:8888 zugreifen können.

    Stellen Sie im Terminal von JupyterLab sicher, dass die folgenden Funktionen im Notizbuch verfügbar sind:

    • Um zu überprüfen, ob JupyterLab auf die vGPU-Ressource zugreifen kann, führen Sie nvidia-smi aus.
    • Um sicherzustellen, dass die mit TensorFlow verbundenen Pakete installiert sind, führen Sie pip show aus.

DCGM Exporter

Sie können einer Deep Learning-VM mit einem DCGM Exporter (Data Center GPU Manager) verwenden, um den Zustand von GPUs zu überwachen und Metriken aus GPUs abzurufen, die von einer DL-Arbeitslast verwendet werden, indem Sie NVIDIA DCGM, Prometheus und Grafana verwenden.

Weitere Informationen finden Sie auf der Seite DCGM Exporter.

In einer Deep Learning-VM führen Sie den DCGM Exporter-Container zusammen mit einer DL-Arbeitslast aus, die KI-Vorgänge durchführt. Nachdem die Deep Learning-VM gestartet wurde, ist DCGM Exporter bereit, vGPU-Metriken zu erfassen und die Daten zur weiteren Überwachung und Visualisierung in eine andere Anwendung zu exportieren. Sie können die überwachte DL-Arbeitslast als Teil des cloud-init-Prozesses oder über die Befehlszeile ausführen, nachdem die virtuelle Maschine gestartet wurde.

Tabelle 4. DCGM Exporter-Container-Image
Komponente Beschreibung
Container-Image
nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag

Beispiel:

nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04

Informationen zu den DCGM Exporter-Container-Images, die für Deep Learning-VMs unterstützt werden, finden Sie unter Versionshinweise zu VMware Deep Learning VM.

Erforderliche Eingaben Um eine DCGM Exporter-Arbeitslast bereitzustellen, müssen Sie die OVF-Eigenschaften für die Deep Learning-VM wie folgt festlegen:
  • Verwenden Sie eine der folgenden Eigenschaften, die spezifisch für das DCGM Exporter-Image sind.
    • Cloud-init-Skript. Codieren Sie es im base64-Format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          echo "Info: running the DCGM Export container"
          deploy_dcgm_exporter
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="ngc_image_tag"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Geben Sie beispielsweise für eine Deep Learning-VM mit dem vorinstallierten DCGM Exporter „dcgm-exporter-Instanz:3.2.5-3.1.8-ubuntu22.04“ das folgende Skript im base64-Format an

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCiAgICAKICAgIERFRkFVTFRfUkVHX1VSST0ibnZjci5pbyIKICAgIFJFR0lTVFJZX1VSSV9QQVRIPSQoZ3JlcCByZWdpc3RyeS11cmkgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQoKICAgIGlmIFtbIC16ICIkUkVHSVNUUllfVVJJX1BBVEgiIF1dOyB0aGVuCiAgICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggaXMgbnVsbCBvciBlbXB0eSwgdXNlIHRoZSBkZWZhdWx0IHZhbHVlCiAgICAgIFJFR0lTVFJZX1VSSV9QQVRIPSRERUZBVUxUX1JFR19VUkkKICAgICAgZWNobyAiUkVHSVNUUllfVVJJX1BBVEggd2FzIGVtcHR5LiBVc2luZyBkZWZhdWx0OiAkUkVHSVNUUllfVVJJX1BBVEgiCiAgICBmaQogICAgCiAgICAjIElmIFJFR0lTVFJZX1VSSV9QQVRIIGNvbnRhaW5zICcvJywgZXh0cmFjdCB0aGUgVVJJIHBhcnQKICAgIGlmIFtbICRSRUdJU1RSWV9VUklfUEFUSCA9PSAqIi8iKiBdXTsgdGhlbgogICAgICBSRUdJU1RSWV9VUkk9JChlY2hvICIkUkVHSVNUUllfVVJJX1BBVEgiIHwgY3V0IC1kJy8nIC1mMSkKICAgIGVsc2UKICAgICAgUkVHSVNUUllfVVJJPSRSRUdJU1RSWV9VUklfUEFUSAogICAgZmkKICAKICAgIFJFR0lTVFJZX1VTRVJOQU1FPSQoZ3JlcCByZWdpc3RyeS11c2VyIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgIFJFR0lTVFJZX1BBU1NXT1JEPSQoZ3JlcCByZWdpc3RyeS1wYXNzd2QgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgaWYgW1sgLW4gIiRSRUdJU1RSWV9VU0VSTkFNRSIgJiYgLW4gIiRSRUdJU1RSWV9QQVNTV09SRCIgXV07IHRoZW4KICAgICAgZG9ja2VyIGxvZ2luIC11ICRSRUdJU1RSWV9VU0VSTkFNRSAtcCAkUkVHSVNUUllfUEFTU1dPUkQgJFJFR0lTVFJZX1VSSQogICAgZWxzZQogICAgICBlY2hvICJXYXJuaW5nOiB0aGUgcmVnaXN0cnkncyB1c2VybmFtZSBhbmQgcGFzc3dvcmQgYXJlIGludmFsaWQsIFNraXBwaW5nIERvY2tlciBsb2dpbi4iCiAgICBmaQoKICAgIGVjaG8gIkluZm86IHJ1bm5pbmcgdGhlIERDR00gRXhwb3J0IGNvbnRhaW5lciIKICAgIGRlcGxveV9kY2dtX2V4cG9ydGVyCgotIHBhdGg6IC9vcHQvZGx2bS91dGlscy5zaAogIHBlcm1pc3Npb25zOiAnMDc1NScKICBjb250ZW50OiB8CiAgICAjIS9iaW4vYmFzaAogICAgZXJyb3JfZXhpdCgpIHsKICAgICAgZWNobyAiRXJyb3I6ICQxIiA+JjIKICAgICAgdm10b29sc2QgLS1jbWQgImluZm8tc2V0IGd1ZXN0aW5mby52bXNlcnZpY2UuYm9vdHN0cmFwLmNvbmRpdGlvbiBmYWxzZSwgRExXb3JrbG9hZEZhaWx1cmUsICQxIgogICAgICBleGl0IDEKICAgIH0KCiAgICBjaGVja19wcm90b2NvbCgpIHsKICAgICAgbG9jYWwgcHJveHlfdXJsPSQxCiAgICAgIHNoaWZ0CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCiAgICAgIGlmIFtbIC1uICIke3Byb3h5X3VybH0iIF1dOyB0aGVuCiAgICAgICAgbG9jYWwgcHJvdG9jb2w9JChlY2hvICIke3Byb3h5X3VybH0iIHwgYXdrIC1GICc6Ly8nICd7aWYgKE5GID4gMSkgcHJpbnQgJDE7IGVsc2UgcHJpbnQgIiJ9JykKICAgICAgICBpZiBbIC16ICIkcHJvdG9jb2wiIF07IHRoZW4KICAgICAgICAgIGVjaG8gIk5vIHNwZWNpZmljIHByb3RvY29sIHByb3ZpZGVkLiBTa2lwcGluZyBwcm90b2NvbCBjaGVjay4iCiAgICAgICAgICByZXR1cm4gMAogICAgICAgIGZpCiAgICAgICAgbG9jYWwgcHJvdG9jb2xfaW5jbHVkZWQ9ZmFsc2UKICAgICAgICBmb3IgdmFyIGluICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IjsgZG8KICAgICAgICAgIGlmIFtbICIke3Byb3RvY29sfSIgPT0gIiR7dmFyfSIgXV07IHRoZW4KICAgICAgICAgICAgcHJvdG9jb2xfaW5jbHVkZWQ9dHJ1ZQogICAgICAgICAgICBicmVhawogICAgICAgICAgZmkKICAgICAgICBkb25lCiAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2xfaW5jbHVkZWR9IiA9PSBmYWxzZSBdXTsgdGhlbgogICAgICAgICAgZXJyb3JfZXhpdCAiVW5zdXBwb3J0ZWQgcHJvdG9jb2w6ICR7cHJvdG9jb2x9LiBTdXBwb3J0ZWQgcHJvdG9jb2xzIGFyZTogJHtzdXBwb3J0ZWRfcHJvdG9jb2xzWypdfSIKICAgICAgICBmaQogICAgICBmaQogICAgfQoKICAgICMgJEA6IGxpc3Qgb2Ygc3VwcG9ydGVkIHByb3RvY29scwogICAgc2V0X3Byb3h5KCkgewogICAgICBsb2NhbCBzdXBwb3J0ZWRfcHJvdG9jb2xzPSgiJEAiKQoKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKCiAgICAgIEhUVFBfUFJPWFlfVVJMPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5odHRwX3Byb3h5IC8vIGVtcHR5JykKICAgICAgSFRUUFNfUFJPWFlfVVJMPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5odHRwc19wcm94eSAvLyBlbXB0eScpCiAgICAgIGlmIFtbICQ/IC1uZSAwIHx8ICgteiAiJHtIVFRQX1BST1hZX1VSTH0iICYmIC16ICIke0hUVFBTX1BST1hZX1VSTH0iKSBdXTsgdGhlbgogICAgICAgIGVjaG8gIkluZm86IFRoZSBjb25maWctanNvbiB3YXMgcGFyc2VkLCBidXQgbm8gcHJveHkgc2V0dGluZ3Mgd2VyZSBmb3VuZC4iCiAgICAgICAgcmV0dXJuIDAKICAgICAgZmkKCiAgICAgIGNoZWNrX3Byb3RvY29sICIke0hUVFBfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCiAgICAgIGNoZWNrX3Byb3RvY29sICIke0hUVFBTX1BST1hZX1VSTH0iICIke3N1cHBvcnRlZF9wcm90b2NvbHNbQF19IgoKICAgICAgaWYgISBncmVwIC1xICdodHRwX3Byb3h5JyAvZXRjL2Vudmlyb25tZW50OyB0aGVuCiAgICAgICAgc3VkbyBiYXNoIC1jICdlY2hvICJleHBvcnQgaHR0cF9wcm94eT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBodHRwc19wcm94eT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH0KICAgICAgICBleHBvcnQgbm9fcHJveHk9bG9jYWxob3N0LDEyNy4wLjAuMSIgPj4gL2V0Yy9lbnZpcm9ubWVudCcKICAgICAgICBzb3VyY2UgL2V0Yy9lbnZpcm9ubWVudAogICAgICBmaQogICAgICAKICAgICAgIyBDb25maWd1cmUgRG9ja2VyIHRvIHVzZSBhIHByb3h5CiAgICAgIHN1ZG8gbWtkaXIgLXAgL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kCiAgICAgIHN1ZG8gYmFzaCAtYyAnZWNobyAiW1NlcnZpY2VdCiAgICAgIEVudmlyb25tZW50PVwiSFRUUF9QUk9YWT0ke0hUVFBfUFJPWFlfVVJMfVwiCiAgICAgIEVudmlyb25tZW50PVwiSFRUUFNfUFJPWFk9JHtIVFRQU19QUk9YWV9VUkx9XCIKICAgICAgRW52aXJvbm1lbnQ9XCJOT19QUk9YWT1sb2NhbGhvc3QsMTI3LjAuMC4xXCIiID4gL2V0Yy9zeXN0ZW1kL3N5c3RlbS9kb2NrZXIuc2VydmljZS5kL3Byb3h5LmNvbmYnCiAgICAgIHN1ZG8gc3lzdGVtY3RsIGRhZW1vbi1yZWxvYWQKICAgICAgc3VkbyBzeXN0ZW1jdGwgcmVzdGFydCBkb2NrZXIKCgogICAgICBlY2hvICJJbmZvOiBkb2NrZXIgYW5kIHN5c3RlbSBlbnZpcm9ubWVudCBhcmUgbm93IGNvbmZpZ3VyZWQgdG8gdXNlIHRoZSBwcm94eSBzZXR0aW5ncyIKICAgIH0KCiAgICBkZXBsb3lfZGNnbV9leHBvcnRlcigpIHsKICAgICAgQ09ORklHX0pTT05fQkFTRTY0PSQoZ3JlcCAnY29uZmlnLWpzb24nIC9vcHQvZGx2bS9vdmYtZW52LnhtbCB8IHNlZCAtbiAncy8uKm9lOnZhbHVlPSJcKFteIl0qXCkuKi9cMS9wJykKICAgICAgQ09ORklHX0pTT049JChlY2hvICR7Q09ORklHX0pTT05fQkFTRTY0fSB8IGJhc2U2NCAtLWRlY29kZSkKICAgICAgRENHTV9FWFBPUlRfUFVCTElDPSQoZWNobyAiJHtDT05GSUdfSlNPTn0iIHwganEgLXIgJy5leHBvcnRfZGNnbV90b19wdWJsaWMgLy8gZW1wdHknKQoKICAgICAgRENHTV9FWFBPUlRFUl9JTUFHRT0iJFJFR0lTVFJZX1VSSV9QQVRIL252aWRpYS9rOHMvZGNnbS1leHBvcnRlciIKICAgICAgRENHTV9FWFBPUlRFUl9WRVJTSU9OPSIzLjIuNS0zLjEuOC11YnVudHUyMi4wNCIKICAgICAgaWYgWyAteiAiJHtEQ0dNX0VYUE9SVF9QVUJMSUN9IiBdIHx8IFsgIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgIT0gInRydWUiIF07IHRoZW4KICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgbGlzdGVuaW5nIG9ubHkgb24gbG9jYWxob3N0ICgxMjcuMC4wLjE6OTQwMCkiCiAgICAgICAgZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLXAgMTI3LjAuMC4xOjk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGVsc2UKICAgICAgICBlY2hvICJJbmZvOiBsYXVuY2hpbmcgRENHTSBFeHBvcnRlciB0byBjb2xsZWN0IHZHUFUgbWV0cmljcywgZXhwb3NlZCBvbiBhbGwgbmV0d29yayBpbnRlcmZhY2VzICgwLjAuMC4wOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDk0MDA6OTQwMCAkRENHTV9FWFBPUlRFUl9JTUFHRTokRENHTV9FWFBPUlRFUl9WRVJTSU9OCiAgICAgIGZpCiAgICB9
      was dem folgenden Skript im Klartextformat entspricht:
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
          
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          echo "Info: running the DCGM Export container"
          deploy_dcgm_exporter
      
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
      Hinweis: Sie können auch die Anweisungen zum Ausführen der DL-Arbeitslast, deren GPU-Leistung Sie mit DCGM Exporter messen möchten, zum cloud-init-Skript hinzufügen.
    • Einzeiliges Image. Codieren Sie es im base64-Format.
      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag-ubuntu22.04

      Geben Sie beispielsweise für „dcgm-exporter:3.2.5-3.1.8-ubuntu22.04“ das folgende Skript im base64-Format an:

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLS1ybSAtcCA5NDAwOjk0MDAgbnZjci5pby9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXI6My4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQ=

      was dem folgenden Skript im Klartextformat entspricht:

      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
  • Geben Sie die Installationseigenschaften des vGPU-Gasttreibers ein, wie z. B. vgpu-license und nvidia-portal-api-key.
  • Geben Sie nach Bedarf Werte für die Eigenschaften an, die für eine getrennte Umgebung erforderlich sind.

Weitere Informationen finden Sie unter OVF-Eigenschaften von Deep Learning-VMs.

Ausgabe
  • Installationsprotokolle für den vGPU-Gasttreiber in /var/log/vgpu-install.log.

    Um zu überprüfen, ob der vGPU-Gasttreiber installiert ist, melden Sie sich über SSH bei der VM an und führen Sie den Befehl nvidia-smi aus.

  • Cloud-init-Skriptprotokolle in /var/log/dl.log.
  • DCGM Exporter, auf den Sie unter http://dl_vm_ip:9400 zugreifen können.

Anschließend führen Sie in der Deep Learning-VM eine DL-Arbeitslast aus und visualisieren die Daten auf einer anderen virtuellen Maschine mithilfe von Prometheus bei http://visualization_vm_ip:9090 und Grafana bei http://visualization_vm_ip:3000.

Ausführen einer DL-Arbeitslast auf der Deep-Lean-VM

Führen Sie die DL-Arbeitslast aus, für die Sie vGPU-Metriken erfassen möchten, und exportieren Sie die Daten zur weiteren Überwachung und Visualisierung in eine andere Anwendung.

  1. Melden Sie sich bei der Deep Learning-VM als vmware über SSH an.
  2. Führen Sie den Container für die DL-Arbeitslast aus und ziehen Sie ihn aus dem NVIDIA NGC-Katalog oder aus einer lokalen Containerregistrierung.

    So führen Sie beispielsweise den folgenden Befehl aus, um das tensorflow-pb24h1:24.03.02-tf2-py3-Image von NVIDIA NGC auszuführen:

    docker run -d --gpus all -p 8888:8888 nvcr.io/nvidia/tensorflow-pb24h1:24.03.02-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token="$TOKEN" --NotebookApp.allow_origin="*" --notebook-dir=/workspace
  3. Beginnen Sie mit der Verwendung der DL-Arbeitslast für die KI-Entwicklung.

Installieren von Prometheus und Grafana

Sie können die vGPU-Metriken von der DCGM Exporter-VM auf einer virtuellen Maschine, auf der Prometheus und Grafana ausgeführt wird, visualisieren und überwachen.

  1. Erstellen Sie eine Visualisierungs-VM mit installierter Docker Community Engine.
  2. Stellen Sie über SSH eine Verbindung zur VM her und erstellen Sie eine YAML-Datei für Prometheus.
    $ cat > prometheus.yml << EOF
    global:
      scrape_interval: 15s
      external_labels:
        monitor: 'codelab-monitor'
    scrape_configs:
      - job_name: 'dcgm'
        scrape_interval: 5s
        metrics_path: /metrics
        static_configs:
          - targets: [dl_vm_with_dcgm_exporter_ip:9400']
    EOF
    
  3. Erstellen Sie einen Datenpfad.
    $ mkdir grafana_data prometheus_data && chmod 777 grafana_data prometheus_data
    
  4. Erstellen Sie eine Docker-Erstellungsdatei, um Prometheus und Grafana zu installieren.
    $ cat > compose.yaml << EOF
    services:
      prometheus:
        image: prom/prometheus:v2.47.2
        container_name: "prometheus0"
        restart: always
        ports:
          - "9090:9090"
        volumes:
          - "./prometheus.yml:/etc/prometheus/prometheus.yml"
          - "./prometheus_data:/prometheus"
      grafana:
        image: grafana/grafana:10.2.0-ubuntu
        container_name: "grafana0"
        ports:
          - "3000:3000"
        restart: always
        volumes:
          - "./grafana_data:/var/lib/grafana"
    EOF
    
  5. Starten Sie die Prometheus- und Grafana-Container.
    $ sudo docker compose up -d        
    

Anzeigen von vGPU-Metriken in Prometheus

Sie können auf Prometheus unter http://visualization-vm-ip:9090 zugreifen. Sie können die folgenden vGPU-Informationen in der Prometheus-Benutzeroberfläche anzeigen:

Informationen Abschnitt der Benutzeroberfläche
vGPU-Rohmetriken aus der Deep Learning-VM Status > Ziel

Um die vGPU-Rohmetriken aus der Deep Learning-VM anzuzeigen, klicken Sie auf den Endpoint-Eintrag.

Diagrammausdrücke
  1. Klicken Sie in der Hauptnavigationsleiste auf die Registerkarte Diagramm.
  2. Geben Sie einen Ausdruck ein und klicken Sie auf Ausführen

Weitere Informationen zur Verwendung von Prometheus finden Sie in der Prometheus-Dokumentation.

Visualisieren von Metriken in Grafana

Legen Sie Prometheus als Datenquelle für Grafana fest und visualisieren Sie die vGPU-Metriken aus der Deep Learning-VM in einem Dashboard.

  1. Greifen Sie unter http://visualization-vm-ip:3000 auf Grafana zu, indem Sie den Standardbenutzernamen admin und das Kennwort admin verwenden.
  2. Fügen Sie Prometheus als erste Datenquelle hinzu und verbinden Sie sich mit visualization-vm-ip auf Port 9090.
  3. Erstellen Sie ein Dashboard mit den vGPU-Metriken.

Weitere Informationen zum Konfigurieren eines Dashboards mithilfe einer Prometheus-Datenquelle finden Sie in der Grafana-Dokumentation.

Triton Inference Server

Sie können eine Deep Learning-VM mit einem Triton Inference Server verwenden, um ein Modell-Repository zu laden und Rückschlussanforderungen zu erhalten.

Weitere Informationen finden Sie auf der Seite Triton Inference Server.

Tabelle 5. Triton Inference Server-Container-Image
Komponente Beschreibung
Container-Image
nvcr.io/nvidia/tritonserver-pb24h1:ngc_image_tag

Beispiel:

nvcr.io/nvidia/tritonserver-pb24h1:24.03.02-py3

Informationen zu den Triton Inference Server-Container-Images, die für Deep Learning-VMs unterstützt werden, finden Sie unter Versionshinweise zu VMware Deep Learning VM.

Erforderliche Eingaben Um eine Triton Inference Server-Arbeitslast bereitzustellen, müssen Sie die OVF-Eigenschaften für die Deep Learning-VM wie folgt festlegen:
  • Verwenden Sie eine der folgenden Eigenschaften, die für das Triton Inference Server-Image spezifisch sind.
    • Cloud-init-Skript. Codieren Sie es im base64-Format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          echo "Info: running the Triton Inference Server container"
          TRITON_IMAGE="$REGISTRY_URI_PATH/nvidia/tritonserver-pb24h1"
          TRITON_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/vmware/model_repository:/models $TRITON_IMAGE:$TRITON_VERSION tritonserver --model-repository=/models --model-control-mode=poll
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }

      Geben Sie beispielsweise für „tritonserver:23.10-py3“ das folgende Skript im base64-Format an:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBzZXQgLWV1CiAgICBzb3VyY2UgL29wdC9kbHZtL3V0aWxzLnNoCiAgICB0cmFwICdlcnJvcl9leGl0ICJVbmV4cGVjdGVkIGVycm9yIG9jY3VycyBhdCBkbCB3b3JrbG9hZCInIEVSUgogICAgc2V0X3Byb3h5ICJodHRwIiAiaHR0cHMiICJzb2NrczUiCgogICAgREVGQVVMVF9SRUdfVVJJPSJudmNyLmlvIgogICAgUkVHSVNUUllfVVJJX1BBVEg9JChncmVwIHJlZ2lzdHJ5LXVyaSAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCgogICAgaWYgW1sgLXogIiRSRUdJU1RSWV9VUklfUEFUSCIgXV07IHRoZW4KICAgICAgIyBJZiBSRUdJU1RSWV9VUklfUEFUSCBpcyBudWxsIG9yIGVtcHR5LCB1c2UgdGhlIGRlZmF1bHQgdmFsdWUKICAgICAgUkVHSVNUUllfVVJJX1BBVEg9JERFRkFVTFRfUkVHX1VSSQogICAgICBlY2hvICJSRUdJU1RSWV9VUklfUEFUSCB3YXMgZW1wdHkuIFVzaW5nIGRlZmF1bHQ6ICRSRUdJU1RSWV9VUklfUEFUSCIKICAgIGZpCiAgICAKICAgICMgSWYgUkVHSVNUUllfVVJJX1BBVEggY29udGFpbnMgJy8nLCBleHRyYWN0IHRoZSBVUkkgcGFydAogICAgaWYgW1sgJFJFR0lTVFJZX1VSSV9QQVRIID09ICoiLyIqIF1dOyB0aGVuCiAgICAgIFJFR0lTVFJZX1VSST0kKGVjaG8gIiRSRUdJU1RSWV9VUklfUEFUSCIgfCBjdXQgLWQnLycgLWYxKQogICAgZWxzZQogICAgICBSRUdJU1RSWV9VUkk9JFJFR0lTVFJZX1VSSV9QQVRICiAgICBmaQogIAogICAgUkVHSVNUUllfVVNFUk5BTUU9JChncmVwIHJlZ2lzdHJ5LXVzZXIgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgUkVHSVNUUllfUEFTU1dPUkQ9JChncmVwIHJlZ2lzdHJ5LXBhc3N3ZCAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICBpZiBbWyAtbiAiJFJFR0lTVFJZX1VTRVJOQU1FIiAmJiAtbiAiJFJFR0lTVFJZX1BBU1NXT1JEIiBdXTsgdGhlbgogICAgICBkb2NrZXIgbG9naW4gLXUgJFJFR0lTVFJZX1VTRVJOQU1FIC1wICRSRUdJU1RSWV9QQVNTV09SRCAkUkVHSVNUUllfVVJJCiAgICBlbHNlCiAgICAgIGVjaG8gIldhcm5pbmc6IHRoZSByZWdpc3RyeSdzIHVzZXJuYW1lIGFuZCBwYXNzd29yZCBhcmUgaW52YWxpZCwgU2tpcHBpbmcgRG9ja2VyIGxvZ2luLiIKICAgIGZpCgogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIKCiAgICBlY2hvICJJbmZvOiBydW5uaW5nIHRoZSBUcml0b24gSW5mZXJlbmNlIFNlcnZlciBjb250YWluZXIiCiAgICBUUklUT05fSU1BR0U9IiRSRUdJU1RSWV9VUklfUEFUSC9udmlkaWEvdHJpdG9uc2VydmVyLXBiMjRoMSIKICAgIFRSSVRPTl9WRVJTSU9OPSIyNC4wMy4wMi1weTMiCiAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLXAgODAwMDo4MDAwIC1wIDgwMDE6ODAwMSAtcCA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyAkVFJJVE9OX0lNQUdFOiRUUklUT05fVkVSU0lPTiB0cml0b25zZXJ2ZXIgLS1tb2RlbC1yZXBvc2l0b3J5PS9tb2RlbHMgLS1tb2RlbC1jb250cm9sLW1vZGU9cG9sbAogICAgCi0gcGF0aDogL29wdC9kbHZtL3V0aWxzLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBlcnJvcl9leGl0KCkgewogICAgICBlY2hvICJFcnJvcjogJDEiID4mMgogICAgICB2bXRvb2xzZCAtLWNtZCAiaW5mby1zZXQgZ3Vlc3RpbmZvLnZtc2VydmljZS5ib290c3RyYXAuY29uZGl0aW9uIGZhbHNlLCBETFdvcmtsb2FkRmFpbHVyZSwgJDEiCiAgICAgIGV4aXQgMQogICAgfQoKICAgIGNoZWNrX3Byb3RvY29sKCkgewogICAgICBsb2NhbCBwcm94eV91cmw9JDEKICAgICAgc2hpZnQKICAgICAgbG9jYWwgc3VwcG9ydGVkX3Byb3RvY29scz0oIiRAIikKICAgICAgaWYgW1sgLW4gIiR7cHJveHlfdXJsfSIgXV07IHRoZW4KICAgICAgICBsb2NhbCBwcm90b2NvbD0kKGVjaG8gIiR7cHJveHlfdXJsfSIgfCBhd2sgLUYgJzovLycgJ3tpZiAoTkYgPiAxKSBwcmludCAkMTsgZWxzZSBwcmludCAiIn0nKQogICAgICAgIGlmIFsgLXogIiRwcm90b2NvbCIgXTsgdGhlbgogICAgICAgICAgZWNobyAiTm8gc3BlY2lmaWMgcHJvdG9jb2wgcHJvdmlkZWQuIFNraXBwaW5nIHByb3RvY29sIGNoZWNrLiIKICAgICAgICAgIHJldHVybiAwCiAgICAgICAgZmkKICAgICAgICBsb2NhbCBwcm90b2NvbF9pbmNsdWRlZD1mYWxzZQogICAgICAgIGZvciB2YXIgaW4gIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iOyBkbwogICAgICAgICAgaWYgW1sgIiR7cHJvdG9jb2x9IiA9PSAiJHt2YXJ9IiBdXTsgdGhlbgogICAgICAgICAgICBwcm90b2NvbF9pbmNsdWRlZD10cnVlCiAgICAgICAgICAgIGJyZWFrCiAgICAgICAgICBmaQogICAgICAgIGRvbmUKICAgICAgICBpZiBbWyAiJHtwcm90b2NvbF9pbmNsdWRlZH0iID09IGZhbHNlIF1dOyB0aGVuCiAgICAgICAgICBlcnJvcl9leGl0ICJVbnN1cHBvcnRlZCBwcm90b2NvbDogJHtwcm90b2NvbH0uIFN1cHBvcnRlZCBwcm90b2NvbHMgYXJlOiAke3N1cHBvcnRlZF9wcm90b2NvbHNbKl19IgogICAgICAgIGZpCiAgICAgIGZpCiAgICB9CgogICAgIyAkQDogbGlzdCBvZiBzdXBwb3J0ZWQgcHJvdG9jb2xzCiAgICBzZXRfcHJveHkoKSB7CiAgICAgIGxvY2FsIHN1cHBvcnRlZF9wcm90b2NvbHM9KCIkQCIpCgogICAgICBDT05GSUdfSlNPTl9CQVNFNjQ9JChncmVwICdjb25maWctanNvbicgL29wdC9kbHZtL292Zi1lbnYueG1sIHwgc2VkIC1uICdzLy4qb2U6dmFsdWU9IlwoW14iXSpcKS4qL1wxL3AnKQogICAgICBDT05GSUdfSlNPTj0kKGVjaG8gJHtDT05GSUdfSlNPTl9CQVNFNjR9IHwgYmFzZTY0IC0tZGVjb2RlKQoKICAgICAgSFRUUF9QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBfcHJveHkgLy8gZW1wdHknKQogICAgICBIVFRQU19QUk9YWV9VUkw9JChlY2hvICIke0NPTkZJR19KU09OfSIgfCBqcSAtciAnLmh0dHBzX3Byb3h5IC8vIGVtcHR5JykKICAgICAgaWYgW1sgJD8gLW5lIDAgfHwgKC16ICIke0hUVFBfUFJPWFlfVVJMfSIgJiYgLXogIiR7SFRUUFNfUFJPWFlfVVJMfSIpIF1dOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogVGhlIGNvbmZpZy1qc29uIHdhcyBwYXJzZWQsIGJ1dCBubyBwcm94eSBzZXR0aW5ncyB3ZXJlIGZvdW5kLiIKICAgICAgICByZXR1cm4gMAogICAgICBmaQoKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUF9QUk9YWV9VUkx9IiAiJHtzdXBwb3J0ZWRfcHJvdG9jb2xzW0BdfSIKICAgICAgY2hlY2tfcHJvdG9jb2wgIiR7SFRUUFNfUFJPWFlfVVJMfSIgIiR7c3VwcG9ydGVkX3Byb3RvY29sc1tAXX0iCgogICAgICBpZiAhIGdyZXAgLXEgJ2h0dHBfcHJveHknIC9ldGMvZW52aXJvbm1lbnQ7IHRoZW4KICAgICAgICBzdWRvIGJhc2ggLWMgJ2VjaG8gImV4cG9ydCBodHRwX3Byb3h5PSR7SFRUUF9QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IGh0dHBzX3Byb3h5PSR7SFRUUFNfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBIVFRQX1BST1hZPSR7SFRUUF9QUk9YWV9VUkx9CiAgICAgICAgZXhwb3J0IEhUVFBTX1BST1hZPSR7SFRUUFNfUFJPWFlfVVJMfQogICAgICAgIGV4cG9ydCBub19wcm94eT1sb2NhbGhvc3QsMTI3LjAuMC4xIiA+PiAvZXRjL2Vudmlyb25tZW50JwogICAgICAgIHNvdXJjZSAvZXRjL2Vudmlyb25tZW50CiAgICAgIGZpCiAgICAgIAogICAgICAjIENvbmZpZ3VyZSBEb2NrZXIgdG8gdXNlIGEgcHJveHkKICAgICAgc3VkbyBta2RpciAtcCAvZXRjL3N5c3RlbWQvc3lzdGVtL2RvY2tlci5zZXJ2aWNlLmQKICAgICAgc3VkbyBiYXNoIC1jICdlY2hvICJbU2VydmljZV0KICAgICAgRW52aXJvbm1lbnQ9XCJIVFRQX1BST1hZPSR7SFRUUF9QUk9YWV9VUkx9XCIKICAgICAgRW52aXJvbm1lbnQ9XCJIVFRQU19QUk9YWT0ke0hUVFBTX1BST1hZX1VSTH1cIgogICAgICBFbnZpcm9ubWVudD1cIk5PX1BST1hZPWxvY2FsaG9zdCwxMjcuMC4wLjFcIiIgPiAvZXRjL3N5c3RlbWQvc3lzdGVtL2RvY2tlci5zZXJ2aWNlLmQvcHJveHkuY29uZicKICAgICAgc3VkbyBzeXN0ZW1jdGwgZGFlbW9uLXJlbG9hZAogICAgICBzdWRvIHN5c3RlbWN0bCByZXN0YXJ0IGRvY2tlcgoKICAgICAgZWNobyAiSW5mbzogZG9ja2VyIGFuZCBzeXN0ZW0gZW52aXJvbm1lbnQgYXJlIG5vdyBjb25maWd1cmVkIHRvIHVzZSB0aGUgcHJveHkgc2V0dGluZ3MiCiAgICB9CgogICAgZGVwbG95X2RjZ21fZXhwb3J0ZXIoKSB7CiAgICAgIENPTkZJR19KU09OX0JBU0U2ND0kKGdyZXAgJ2NvbmZpZy1qc29uJyAvb3B0L2Rsdm0vb3ZmLWVudi54bWwgfCBzZWQgLW4gJ3MvLipvZTp2YWx1ZT0iXChbXiJdKlwpLiovXDEvcCcpCiAgICAgIENPTkZJR19KU09OPSQoZWNobyAke0NPTkZJR19KU09OX0JBU0U2NH0gfCBiYXNlNjQgLS1kZWNvZGUpCiAgICAgIERDR01fRVhQT1JUX1BVQkxJQz0kKGVjaG8gIiR7Q09ORklHX0pTT059IiB8IGpxIC1yICcuZXhwb3J0X2RjZ21fdG9fcHVibGljIC8vIGVtcHR5JykKCiAgICAgIERDR01fRVhQT1JURVJfSU1BR0U9IiRSRUdJU1RSWV9VUklfUEFUSC9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXIiCiAgICAgIERDR01fRVhQT1JURVJfVkVSU0lPTj0iMy4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQiCiAgICAgIGlmIFsgLXogIiR7RENHTV9FWFBPUlRfUFVCTElDfSIgXSB8fCBbICIke0RDR01fRVhQT1JUX1BVQkxJQ30iICE9ICJ0cnVlIiBdOyB0aGVuCiAgICAgICAgZWNobyAiSW5mbzogbGF1bmNoaW5nIERDR00gRXhwb3J0ZXIgdG8gY29sbGVjdCB2R1BVIG1ldHJpY3MsIGxpc3RlbmluZyBvbmx5IG9uIGxvY2FsaG9zdCAoMTI3LjAuMC4xOjk0MDApIgogICAgICAgIGRvY2tlciBydW4gLWQgLS1ncHVzIGFsbCAtLWNhcC1hZGQgU1lTX0FETUlOIC1wIDEyNy4wLjAuMTo5NDAwOjk0MDAgJERDR01fRVhQT1JURVJfSU1BR0U6JERDR01fRVhQT1JURVJfVkVSU0lPTgogICAgICBlbHNlCiAgICAgICAgZWNobyAiSW5mbzogbGF1bmNoaW5nIERDR00gRXhwb3J0ZXIgdG8gY29sbGVjdCB2R1BVIG1ldHJpY3MsIGV4cG9zZWQgb24gYWxsIG5ldHdvcmsgaW50ZXJmYWNlcyAoMC4wLjAuMDo5NDAwKSIKICAgICAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLS1jYXAtYWRkIFNZU19BRE1JTiAtcCA5NDAwOjk0MDAgJERDR01fRVhQT1JURVJfSU1BR0U6JERDR01fRVhQT1JURVJfVkVSU0lPTgogICAgICBmaQogICAgfQ==

      was dem folgenden Skript im Klartextformat entspricht:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          set -eu
          source /opt/dlvm/utils.sh
          trap 'error_exit "Unexpected error occurs at dl workload"' ERR
          set_proxy "http" "https" "socks5"
      
          DEFAULT_REG_URI="nvcr.io"
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
          if [[ -z "$REGISTRY_URI_PATH" ]]; then
            # If REGISTRY_URI_PATH is null or empty, use the default value
            REGISTRY_URI_PATH=$DEFAULT_REG_URI
            echo "REGISTRY_URI_PATH was empty. Using default: $REGISTRY_URI_PATH"
          fi
          
          # If REGISTRY_URI_PATH contains '/', extract the URI part
          if [[ $REGISTRY_URI_PATH == *"/"* ]]; then
            REGISTRY_URI=$(echo "$REGISTRY_URI_PATH" | cut -d'/' -f1)
          else
            REGISTRY_URI=$REGISTRY_URI_PATH
          fi
        
          REGISTRY_USERNAME=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          REGISTRY_PASSWORD=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -n "$REGISTRY_USERNAME" && -n "$REGISTRY_PASSWORD" ]]; then
            docker login -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD $REGISTRY_URI
          else
            echo "Warning: the registry's username and password are invalid, Skipping Docker login."
          fi
      
          deploy_dcgm_exporter
      
          echo "Info: running the Triton Inference Server container"
          TRITON_IMAGE="$REGISTRY_URI_PATH/nvidia/tritonserver-pb24h1"
          TRITON_VERSION="24.03.02-py3"
          docker run -d --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/vmware/model_repository:/models $TRITON_IMAGE:$TRITON_VERSION tritonserver --model-repository=/models --model-control-mode=poll
          
      - path: /opt/dlvm/utils.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
            exit 1
          }
      
          check_protocol() {
            local proxy_url=$1
            shift
            local supported_protocols=("$@")
            if [[ -n "${proxy_url}" ]]; then
              local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
              if [ -z "$protocol" ]; then
                echo "No specific protocol provided. Skipping protocol check."
                return 0
              fi
              local protocol_included=false
              for var in "${supported_protocols[@]}"; do
                if [[ "${protocol}" == "${var}" ]]; then
                  protocol_included=true
                  break
                fi
              done
              if [[ "${protocol_included}" == false ]]; then
                error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
              fi
            fi
          }
      
          # $@: list of supported protocols
          set_proxy() {
            local supported_protocols=("$@")
      
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      
            HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
            HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
            if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
              echo "Info: The config-json was parsed, but no proxy settings were found."
              return 0
            fi
      
            check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
            check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
      
            if ! grep -q 'http_proxy' /etc/environment; then
              sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
              export https_proxy=${HTTPS_PROXY_URL}
              export HTTP_PROXY=${HTTP_PROXY_URL}
              export HTTPS_PROXY=${HTTPS_PROXY_URL}
              export no_proxy=localhost,127.0.0.1" >> /etc/environment'
              source /etc/environment
            fi
            
            # Configure Docker to use a proxy
            sudo mkdir -p /etc/systemd/system/docker.service.d
            sudo bash -c 'echo "[Service]
            Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
            Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
            Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
            sudo systemctl daemon-reload
            sudo systemctl restart docker
      
            echo "Info: docker and system environment are now configured to use the proxy settings"
          }
      
          deploy_dcgm_exporter() {
            CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
            DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
      
            DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
            DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
            if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
              echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            else
              echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
              docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
            fi
          }
    • Einzeiliges Bild im base64-Format codiert
      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver-pb24h1:ngc_image_tag tritonserver --model-repository=/models --model-control-mode=poll

      Geben Sie beispielsweise für „tritonserver:24.03.02-py3“ das folgende Skript im base64-Format an:

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tcm0gLXA4MDAwOjgwMDAgLXA4MDAxOjgwMDEgLXA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyBudmNyLmlvL252aWRpYS90cml0b25zZXJ2ZXItcGIyNGgxOjI0LjAzLjAyLXB5MyB0cml0b25zZXJ2ZXIgLS1tb2RlbC1yZXBvc2l0b3J5PS9tb2RlbHMgLS1tb2RlbC1jb250cm9sLW1vZGU9cG9sbA==

      was dem folgenden Skript im Klartextformat entspricht:

      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver-pb24h1:24.03.02-py3 tritonserver --model-repository=/models --model-control-mode=poll
  • Geben Sie die Installationseigenschaften des vGPU-Gasttreibers ein, wie z. B. vgpu-license und nvidia-portal-api-key.
  • Geben Sie nach Bedarf Werte für die Eigenschaften an, die für eine getrennte Umgebung erforderlich sind.

Weitere Informationen finden Sie unter OVF-Eigenschaften von Deep Learning-VMs.

Ausgabe
  • Installationsprotokolle für den vGPU-Gasttreiber in /var/log/vgpu-install.log.

    Um zu überprüfen, ob der vGPU-Gasttreiber installiert ist, melden Sie sich über SSH bei der VM an und führen Sie den Befehl nvidia-smi aus.

  • Cloud-init-Skriptprotokolle in /var/log/dl.log.
  • Triton Inference Server-Container.

    Um sicherzustellen, dass der Triton Inference Server-Container ausgeführt wird, führen Sie die Befehle docker ps -a und docker logs container_id aus.

Das Modell-Repository für den Triton Inference Server befindet sich in /home/vmware/model_repository. Anfänglich ist das Modell-Repository leer, und das anfängliche Protokoll der Triton Inference Server-Instanz zeigt an, dass kein Modell geladen ist.

Erstellen eines Modell-Repository

Führen Sie die folgenden Schritte aus, um Ihr Modell für den Modell-Rückschluss zu laden:

  1. Erstellen Sie das Modell-Repository für Ihr Modell.

    Weitere Informationen finden Sie in der Dokumentation Dokumentation zum NVIDIA Triton Inference Server-Modell-Repository.

  2. Kopieren Sie das Modell-Repository in /home/vmware/model_repository, damit der Triton Inference Server es laden kann.
    cp -r path_to_your_created_model_repository/* /home/vmware/model_repository/
    

Senden von Modell-Rückschlussanforderungen

  1. Stellen Sie sicher, dass der Triton Inference Server fehlerfrei ist und die Modelle bereit sind, indem Sie diesen Befehl in der Deep Learning-VM-Konsole ausführen.
    curl -v localhost:8000/v2/simple_sequence
  2. Senden Sie eine Anforderung an das Modell, indem Sie diesen Befehl auf der Deep Learning-VM ausführen.
    curl -v localhost:8000/v2/models/simple_sequence

Weitere Informationen zur Verwendung des Triton Inference-Servers finden Sie in der Dokumentation NVIDIA Triton Inference Server-Modell-Repository.

NVIDIA RAG

Sie können eine Deep Learning-VM verwenden, um RAG-Lösungen (Retrieval Augmented Generation) mit einem Llama2-Modell zu erstellen.

Weitere Informationen finden Sie in der Dokumentation zu NVIDIA RAG-Anwendungen mit Docker Compose (erfordert bestimmte Kontoberechtigungen).

Tabelle 6. NVIDIA RAG-Container-Image
Komponente Beschreibung
Container-Images und -Modelle
docker-compose-nim-ms.yaml
rag-app-multiturn-chatbot/docker-compose.yaml
in der NVIDIA-Beispiel-RAG-Pipeline.

Informationen zu den für Deep Learning-VMs unterstützten NVIDIA RAG-Containeranwendungen finden Sie unter Versionshinweise zu VMware Deep Learning VM.

Erforderliche Eingaben Um eine NVIDIA RAG-Arbeitslast bereitzustellen, müssen Sie die OVF-Eigenschaften für die Deep Learning-VMs wie folgt festlegen:
  • Geben Sie ein cloud-init-Skript ein. Codieren Sie es im base64-Format.

    Geben Sie beispielsweise für Version 24.08 von NVIDIA RAG das folgende Skript an:

    #cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    set -eu
    source /opt/dlvm/utils.sh
    trap 'error_exit "Unexpected error occurs at dl workload"' ERR
    set_proxy "http" "https"
    
    sudo mkdir -p /opt/data/
    sudo chown vmware:vmware /opt/data
    sudo chmod -R 775 /opt/data
    cd /opt/data/

    cat <<EOF > /opt/data/config.json
    {
      "_comment_1": "This provides default support for RAG v24.08: llama3-8b-instruct model",
      "_comment_2": "Update llm_ms_gpu_id: specifies the GPU device ID to make available to the inference server when using multiple GPU",
      "_comment_3": "Update embedding_ms_gpu_id: specifies the GPU ID used for embedding model processing when using multiple GPU",
      "rag": {
        "org_name": "nvidia",
        "org_team_name": "aiworkflows",
        "rag_name": "ai-chatbot-docker-workflow",
        "rag_version": "24.08",
        "rag_app": "rag-app-multiturn-chatbot",
        "nim_model_profile": "auto",
        "llm_ms_gpu_id": "0",
        "embedding_ms_gpu_id": "0",
        "model_directory": "model-cache",
        "ngc_cli_version": "3.41.2"
      }
    }
    EOF

    CONFIG_JSON=$(cat "/opt/data/config.json")
    required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_NAME" "RAG_VERSION" "RAG_APP" "NIM_MODEL_PROFILE" "LLM_MS_GPU_ID" "EMBEDDING_MS_GPU_ID" "MODEL_DIRECTORY" "NGC_CLI_VERSION")

    # Extract rag values from /opt/data/config.json
    for index in "${!required_vars[@]}"; do
      key="${required_vars[$index]}"
      jq_query=".rag.${key,,} | select (.!=null)"
      value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
      if [[ -z "${value}" ]]; then 
        error_exit "${key} is required but not set."
      else
        eval ${key}=\""${value}"\"
      fi
    done

    # Read parameters from config-json to connect DSM PGVector on RAG
    CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    CONFIG_JSON_PGVECTOR=$(echo "${CONFIG_JSON_BASE64}" | base64 -d)
    PGVECTOR_VALUE=$(echo ${CONFIG_JSON_PGVECTOR} | jq -r '.rag.pgvector')
    if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then
      echo "Info: extract DSM PGVector parameters from config-json in XML"
      POSTGRES_USER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $4}')
      POSTGRES_PASSWORD=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $5}')
      POSTGRES_HOST_IP=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $6}')
      POSTGRES_PORT_NUMBER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $7}')
      POSTGRES_DB=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $8}')

      for var in POSTGRES_USER POSTGRES_PASSWORD POSTGRES_HOST_IP POSTGRES_PORT_NUMBER POSTGRES_DB; do
        if [ -z "${!var}" ]; then
          error_exit "${var} is not set."
        fi
      done
    fi

    gpu_info=$(nvidia-smi -L)
    echo "Info: the detected GPU info, $gpu_info"
    if [[ ${NIM_MODEL_PROFILE} == "auto" ]]; then 
      case "${gpu_info}" in
        *A100*)
          NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
          echo "Info: GPU type A100 detected. Setting tensorrt_llm-A100-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *H100*)
          NIM_MODEL_PROFILE="cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1"
          echo "Info: GPU type H100 detected. Setting tensorrt_llm-H100-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *L40S*)
          NIM_MODEL_PROFILE="d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80"
          echo "Info: GPU type L40S detected. Setting tensorrt_llm-L40S-fp16-tp1-throughput as the default NIM model profile."
          ;;
        *)
          NIM_MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
          echo "Info: No supported GPU type detected (A100, H100, L40S). Setting vllm as the default NIM model profile."
          ;;
      esac
    else
      echo "Info: using the NIM model profile provided by the user, $NIM_MODEL_PROFILE"
    fi

    RAG_URI="${ORG_NAME}/${ORG_TEAM_NAME}/${RAG_NAME}:${RAG_VERSION}"
    RAG_FOLDER="${RAG_NAME}_v${RAG_VERSION}"
    NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"

    if [ ! -f .initialize ]; then
      # clean up
      rm -rf compose.env ngc* ${RAG_NAME}* ${MODEL_DIRECTORY}* .initialize

      # install ngc-cli
      wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip -q ngccli_linux.zip
      export PATH=`pwd`/ngc-cli:${PATH}

      APIKEY=""
      DEFAULT_REG_URI="nvcr.io"

      REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      if [[ -z "${REGISTRY_URI_PATH}" ]]; then
        REGISTRY_URI_PATH=${DEFAULT_REG_URI}
        echo "Info: registry uri was empty. Using default: ${REGISTRY_URI_PATH}"
      fi

      if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${DEFAULT_REG_URI}"* ]]; then
        APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      fi

      if [ -z "${APIKEY}" ]; then
          error_exit "No APIKEY found"
      fi

      # config ngc-cli
      mkdir -p ~/.ngc

      cat << EOF > ~/.ngc/config
      [CURRENT]
      apikey = ${APIKEY}
      format_type = ascii
      org = ${ORG_NAME}
      team = ${ORG_TEAM_NAME}
      ace = no-ace
    EOF
      
      # Extract registry URI if path contains '/'
      if [[ ${REGISTRY_URI_PATH} == *"/"* ]]; then
        REGISTRY_URI=$(echo "${REGISTRY_URI_PATH}" | cut -d'/' -f1)
      else
        REGISTRY_URI=${REGISTRY_URI_PATH}
      fi

      REGISTRY_USER=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

      # Docker login if credentials are provided
      if [[ -n "${REGISTRY_USER}" && -n "${APIKEY}" ]]; then
        docker login -u ${REGISTRY_USER} -p ${APIKEY} ${REGISTRY_URI}
      else
        echo "Warning: the ${REGISTRY_URI} registry's username and password are invalid, Skipping Docker login."
      fi

      # DockerHub login for general components
      DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

      DOCKERHUB_URI=${DOCKERHUB_URI:-docker.io}
      if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
        docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD} ${DOCKERHUB_URI}
      else
        echo "Warning: ${DOCKERHUB_URI} not logged in"
      fi

      # Download RAG files
      ngc registry resource download-version ${RAG_URI}

      mkdir -p /opt/data/${MODEL_DIRECTORY}

      # Update the docker-compose YAML files to correct the issue with GPU free/non-free status reporting
      /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/docker-compose-nim-ms.yaml"> docker-compose-nim-ms.json
      jq --arg profile "${NIM_MODEL_PROFILE}" \
         '.services."nemollm-inference".environment.NIM_MANIFEST_ALLOW_UNSAFE = "1" |
          .services."nemollm-inference".environment.NIM_MODEL_PROFILE = $profile |
          .services."nemollm-inference".deploy.resources.reservations.devices[0].device_ids = ["${LLM_MS_GPU_ID:-0}"] |
          del(.services."nemollm-inference".deploy.resources.reservations.devices[0].count)' docker-compose-nim-ms.json > temp.json && mv temp.json docker-compose-nim-ms.json
      /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < docker-compose-nim-ms.json > "${RAG_FOLDER}/docker-compose-nim-ms.yaml"
      rm -rf docker-compose-nim-ms.json

      # Update docker-compose YAML files to config PGVector as the default databse
      /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"> rag-app-multiturn-chatbot.json
      jq '.services."chain-server".environment.APP_VECTORSTORE_NAME = "pgvector" |
         .services."chain-server".environment.APP_VECTORSTORE_URL = "${POSTGRES_HOST_IP:-pgvector}:${POSTGRES_PORT_NUMBER:-5432}" |
         .services."chain-server".environment.POSTGRES_PASSWORD = "${POSTGRES_PASSWORD:-password}" |
         .services."chain-server".environment.POSTGRES_USER = "${POSTGRES_USER:-postgres}" |
         .services."chain-server".environment.POSTGRES_DB = "${POSTGRES_DB:-api}"' rag-app-multiturn-chatbot.json > temp.json && mv temp.json rag-app-multiturn-chatbot.json
      /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-multiturn-chatbot.json > "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"
      rm -rf rag-app-multiturn-chatbot.json

      # config compose.env
      cat << EOF > compose.env
      export MODEL_DIRECTORY="/opt/data/${MODEL_DIRECTORY}"
      export NGC_API_KEY=${APIKEY}
      export USERID=$(id -u)
      export LLM_MS_GPU_ID=${LLM_MS_GPU_ID}
      export EMBEDDING_MS_GPU_ID=${EMBEDDING_MS_GPU_ID}
    EOF

      if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then 
        cat << EOF >> compose.env
        export POSTGRES_HOST_IP="${POSTGRES_HOST_IP}"
        export POSTGRES_PORT_NUMBER="${POSTGRES_PORT_NUMBER}"
        export POSTGRES_PASSWORD="${POSTGRES_PASSWORD}"
        export POSTGRES_USER="${POSTGRES_USER}"
        export POSTGRES_DB="${POSTGRES_DB}"
    EOF
      fi
    
      touch .initialize

      deploy_dcgm_exporter
    fi

    # start NGC RAG
    echo "Info: running the RAG application"
    source compose.env
    if [ -z "${PGVECTOR_VALUE}" ] || [ "${PGVECTOR_VALUE}" = "null" ]; then 
      echo "Info: running the pgvector container as the Vector Database"
      docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim --profile pgvector up -d
    else
      echo "Info: using the provided DSM PGVector as the Vector Database"
      docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim up -d
    fi
    
- path: /opt/dlvm/utils.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    error_exit() {
      echo "Error: $1" >&2
      vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
      exit 1
    }

    check_protocol() {
      local proxy_url=$1
      shift
      local supported_protocols=("$@")
      if [[ -n "${proxy_url}" ]]; then
        local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
        if [ -z "$protocol" ]; then
          echo "No specific protocol provided. Skipping protocol check."
          return 0
        fi
        local protocol_included=false
        for var in "${supported_protocols[@]}"; do
          if [[ "${protocol}" == "${var}" ]]; then
            protocol_included=true
            break
          fi
        done
        if [[ "${protocol_included}" == false ]]; then
          error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
        fi
      fi
    }

    # $@: list of supported protocols
    set_proxy() {
      local supported_protocols=("$@")

      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)

      HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
      HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
      if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
        echo "Info: The config-json was parsed, but no proxy settings were found."
        return 0
      fi

      check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
      check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"

      if ! grep -q 'http_proxy' /etc/environment; then
        sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
        export https_proxy=${HTTPS_PROXY_URL}
        export HTTP_PROXY=${HTTP_PROXY_URL}
        export HTTPS_PROXY=${HTTPS_PROXY_URL}
        export no_proxy=localhost,127.0.0.1" >> /etc/environment'
        source /etc/environment
      fi
      
      # Configure Docker to use a proxy
      sudo mkdir -p /etc/systemd/system/docker.service.d
      sudo bash -c 'echo "[Service]
      Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
      Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
      Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
      sudo systemctl daemon-reload
      sudo systemctl restart docker

      echo "Info: docker and system environment are now configured to use the proxy settings"
    }

    deploy_dcgm_exporter() {
      CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
      DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')

      DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
      DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
      if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
        echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      else
        echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
        docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
      fi
    }

    was dem folgenden Skript im Klartextformat entspricht:

    #cloud-config
    write_files:
    - path: /opt/dlvm/dl_app.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        set -eu
        source /opt/dlvm/utils.sh
        trap 'error_exit "Unexpected error occurs at dl workload"' ERR
        set_proxy "http" "https"
        
        sudo mkdir -p /opt/data/
        sudo chown vmware:vmware /opt/data
        sudo chmod -R 775 /opt/data
        cd /opt/data/
    
        cat <<EOF > /opt/data/config.json
        {
          "_comment_1": "This provides default support for RAG v24.08: llama3-8b-instruct model",
          "_comment_2": "Update llm_ms_gpu_id: specifies the GPU device ID to make available to the inference server when using multiple GPU",
          "_comment_3": "Update embedding_ms_gpu_id: specifies the GPU ID used for embedding model processing when using multiple GPU",
          "rag": {
            "org_name": "nvidia",
            "org_team_name": "aiworkflows",
            "rag_name": "ai-chatbot-docker-workflow",
            "rag_version": "24.08",
            "rag_app": "rag-app-multiturn-chatbot",
            "nim_model_profile": "auto",
            "llm_ms_gpu_id": "0",
            "embedding_ms_gpu_id": "0",
            "model_directory": "model-cache",
            "ngc_cli_version": "3.41.2"
          }
        }
        EOF
    
        CONFIG_JSON=$(cat "/opt/data/config.json")
        required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_NAME" "RAG_VERSION" "RAG_APP" "NIM_MODEL_PROFILE" "LLM_MS_GPU_ID" "EMBEDDING_MS_GPU_ID" "MODEL_DIRECTORY" "NGC_CLI_VERSION")
    
        # Extract rag values from /opt/data/config.json
        for index in "${!required_vars[@]}"; do
          key="${required_vars[$index]}"
          jq_query=".rag.${key,,} | select (.!=null)"
          value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
          if [[ -z "${value}" ]]; then 
            error_exit "${key} is required but not set."
          else
            eval ${key}=\""${value}"\"
          fi
        done
    
        # Read parameters from config-json to connect DSM PGVector on RAG
        CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
        CONFIG_JSON_PGVECTOR=$(echo "${CONFIG_JSON_BASE64}" | base64 -d)
        PGVECTOR_VALUE=$(echo ${CONFIG_JSON_PGVECTOR} | jq -r '.rag.pgvector')
        if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then
          echo "Info: extract DSM PGVector parameters from config-json in XML"
          POSTGRES_USER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $4}')
          POSTGRES_PASSWORD=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $5}')
          POSTGRES_HOST_IP=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $6}')
          POSTGRES_PORT_NUMBER=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $7}')
          POSTGRES_DB=$(echo ${PGVECTOR_VALUE} | awk -F[:@/] '{print $8}')
    
          for var in POSTGRES_USER POSTGRES_PASSWORD POSTGRES_HOST_IP POSTGRES_PORT_NUMBER POSTGRES_DB; do
            if [ -z "${!var}" ]; then
              error_exit "${var} is not set."
            fi
          done
        fi
    
        gpu_info=$(nvidia-smi -L)
        echo "Info: the detected GPU info, $gpu_info"
        if [[ ${NIM_MODEL_PROFILE} == "auto" ]]; then 
          case "${gpu_info}" in
            *A100*)
              NIM_MODEL_PROFILE="751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c"
              echo "Info: GPU type A100 detected. Setting tensorrt_llm-A100-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *H100*)
              NIM_MODEL_PROFILE="cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1"
              echo "Info: GPU type H100 detected. Setting tensorrt_llm-H100-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *L40S*)
              NIM_MODEL_PROFILE="d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80"
              echo "Info: GPU type L40S detected. Setting tensorrt_llm-L40S-fp16-tp1-throughput as the default NIM model profile."
              ;;
            *)
              NIM_MODEL_PROFILE="8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d"
              echo "Info: No supported GPU type detected (A100, H100, L40S). Setting vllm as the default NIM model profile."
              ;;
          esac
        else
          echo "Info: using the NIM model profile provided by the user, $NIM_MODEL_PROFILE"
        fi
    
        RAG_URI="${ORG_NAME}/${ORG_TEAM_NAME}/${RAG_NAME}:${RAG_VERSION}"
        RAG_FOLDER="${RAG_NAME}_v${RAG_VERSION}"
        NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"
    
        if [ ! -f .initialize ]; then
          # clean up
          rm -rf compose.env ngc* ${RAG_NAME}* ${MODEL_DIRECTORY}* .initialize
    
          # install ngc-cli
          wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip -q ngccli_linux.zip
          export PATH=`pwd`/ngc-cli:${PATH}
    
          APIKEY=""
          DEFAULT_REG_URI="nvcr.io"
    
          REGISTRY_URI_PATH=$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          if [[ -z "${REGISTRY_URI_PATH}" ]]; then
            REGISTRY_URI_PATH=${DEFAULT_REG_URI}
            echo "Info: registry uri was empty. Using default: ${REGISTRY_URI_PATH}"
          fi
    
          if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${DEFAULT_REG_URI}"* ]]; then
            APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          fi
    
          if [ -z "${APIKEY}" ]; then
              error_exit "No APIKEY found"
          fi
    
          # config ngc-cli
          mkdir -p ~/.ngc
    
          cat << EOF > ~/.ngc/config
          [CURRENT]
          apikey = ${APIKEY}
          format_type = ascii
          org = ${ORG_NAME}
          team = ${ORG_TEAM_NAME}
          ace = no-ace
        EOF
          
          # Extract registry URI if path contains '/'
          if [[ ${REGISTRY_URI_PATH} == *"/"* ]]; then
            REGISTRY_URI=$(echo "${REGISTRY_URI_PATH}" | cut -d'/' -f1)
          else
            REGISTRY_URI=${REGISTRY_URI_PATH}
          fi
    
          REGISTRY_USER=$(grep registry-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
          # Docker login if credentials are provided
          if [[ -n "${REGISTRY_USER}" && -n "${APIKEY}" ]]; then
            docker login -u ${REGISTRY_USER} -p ${APIKEY} ${REGISTRY_URI}
          else
            echo "Warning: the ${REGISTRY_URI} registry's username and password are invalid, Skipping Docker login."
          fi
    
          # DockerHub login for general components
          DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
          DOCKERHUB_URI=${DOCKERHUB_URI:-docker.io}
          if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
            docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD} ${DOCKERHUB_URI}
          else
            echo "Warning: ${DOCKERHUB_URI} not logged in"
          fi
    
          # Download RAG files
          ngc registry resource download-version ${RAG_URI}
    
          mkdir -p /opt/data/${MODEL_DIRECTORY}
    
          # Update the docker-compose YAML files to correct the issue with GPU free/non-free status reporting
          /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/docker-compose-nim-ms.yaml"> docker-compose-nim-ms.json
          jq --arg profile "${NIM_MODEL_PROFILE}" \
             '.services."nemollm-inference".environment.NIM_MANIFEST_ALLOW_UNSAFE = "1" |
              .services."nemollm-inference".environment.NIM_MODEL_PROFILE = $profile |
              .services."nemollm-inference".deploy.resources.reservations.devices[0].device_ids = ["${LLM_MS_GPU_ID:-0}"] |
              del(.services."nemollm-inference".deploy.resources.reservations.devices[0].count)' docker-compose-nim-ms.json > temp.json && mv temp.json docker-compose-nim-ms.json
          /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < docker-compose-nim-ms.json > "${RAG_FOLDER}/docker-compose-nim-ms.yaml"
          rm -rf docker-compose-nim-ms.json
    
          # Update docker-compose YAML files to config PGVector as the default databse
          /usr/bin/python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"> rag-app-multiturn-chatbot.json
          jq '.services."chain-server".environment.APP_VECTORSTORE_NAME = "pgvector" |
             .services."chain-server".environment.APP_VECTORSTORE_URL = "${POSTGRES_HOST_IP:-pgvector}:${POSTGRES_PORT_NUMBER:-5432}" |
             .services."chain-server".environment.POSTGRES_PASSWORD = "${POSTGRES_PASSWORD:-password}" |
             .services."chain-server".environment.POSTGRES_USER = "${POSTGRES_USER:-postgres}" |
             .services."chain-server".environment.POSTGRES_DB = "${POSTGRES_DB:-api}"' rag-app-multiturn-chatbot.json > temp.json && mv temp.json rag-app-multiturn-chatbot.json
          /usr/bin/python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-multiturn-chatbot.json > "${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml"
          rm -rf rag-app-multiturn-chatbot.json
    
          # config compose.env
          cat << EOF > compose.env
          export MODEL_DIRECTORY="/opt/data/${MODEL_DIRECTORY}"
          export NGC_API_KEY=${APIKEY}
          export USERID=$(id -u)
          export LLM_MS_GPU_ID=${LLM_MS_GPU_ID}
          export EMBEDDING_MS_GPU_ID=${EMBEDDING_MS_GPU_ID}
        EOF
    
          if [[ -n "${PGVECTOR_VALUE}" && "${PGVECTOR_VALUE}" != "null" ]]; then 
            cat << EOF >> compose.env
            export POSTGRES_HOST_IP="${POSTGRES_HOST_IP}"
            export POSTGRES_PORT_NUMBER="${POSTGRES_PORT_NUMBER}"
            export POSTGRES_PASSWORD="${POSTGRES_PASSWORD}"
            export POSTGRES_USER="${POSTGRES_USER}"
            export POSTGRES_DB="${POSTGRES_DB}"
        EOF
          fi
        
          touch .initialize
    
          deploy_dcgm_exporter
        fi
    
        # start NGC RAG
        echo "Info: running the RAG application"
        source compose.env
        if [ -z "${PGVECTOR_VALUE}" ] || [ "${PGVECTOR_VALUE}" = "null" ]; then 
          echo "Info: running the pgvector container as the Vector Database"
          docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim --profile pgvector up -d
        else
          echo "Info: using the provided DSM PGVector as the Vector Database"
          docker compose -f ${RAG_FOLDER}/${RAG_APP}/docker-compose.yaml --profile local-nim up -d
        fi
        
    - path: /opt/dlvm/utils.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        error_exit() {
          echo "Error: $1" >&2
          vmtoolsd --cmd "info-set guestinfo.vmservice.bootstrap.condition false, DLWorkloadFailure, $1"
          exit 1
        }
    
        check_protocol() {
          local proxy_url=$1
          shift
          local supported_protocols=("$@")
          if [[ -n "${proxy_url}" ]]; then
            local protocol=$(echo "${proxy_url}" | awk -F '://' '{if (NF > 1) print $1; else print ""}')
            if [ -z "$protocol" ]; then
              echo "No specific protocol provided. Skipping protocol check."
              return 0
            fi
            local protocol_included=false
            for var in "${supported_protocols[@]}"; do
              if [[ "${protocol}" == "${var}" ]]; then
                protocol_included=true
                break
              fi
            done
            if [[ "${protocol_included}" == false ]]; then
              error_exit "Unsupported protocol: ${protocol}. Supported protocols are: ${supported_protocols[*]}"
            fi
          fi
        }
    
        # $@: list of supported protocols
        set_proxy() {
          local supported_protocols=("$@")
    
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
    
          HTTP_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.http_proxy // empty')
          HTTPS_PROXY_URL=$(echo "${CONFIG_JSON}" | jq -r '.https_proxy // empty')
          if [[ $? -ne 0 || (-z "${HTTP_PROXY_URL}" && -z "${HTTPS_PROXY_URL}") ]]; then
            echo "Info: The config-json was parsed, but no proxy settings were found."
            return 0
          fi
    
          check_protocol "${HTTP_PROXY_URL}" "${supported_protocols[@]}"
          check_protocol "${HTTPS_PROXY_URL}" "${supported_protocols[@]}"
    
          if ! grep -q 'http_proxy' /etc/environment; then
            sudo bash -c 'echo "export http_proxy=${HTTP_PROXY_URL}
            export https_proxy=${HTTPS_PROXY_URL}
            export HTTP_PROXY=${HTTP_PROXY_URL}
            export HTTPS_PROXY=${HTTPS_PROXY_URL}
            export no_proxy=localhost,127.0.0.1" >> /etc/environment'
            source /etc/environment
          fi
          
          # Configure Docker to use a proxy
          sudo mkdir -p /etc/systemd/system/docker.service.d
          sudo bash -c 'echo "[Service]
          Environment=\"HTTP_PROXY=${HTTP_PROXY_URL}\"
          Environment=\"HTTPS_PROXY=${HTTPS_PROXY_URL}\"
          Environment=\"NO_PROXY=localhost,127.0.0.1\"" > /etc/systemd/system/docker.service.d/proxy.conf'
          sudo systemctl daemon-reload
          sudo systemctl restart docker
    
          echo "Info: docker and system environment are now configured to use the proxy settings"
        }
    
        deploy_dcgm_exporter() {
          CONFIG_JSON_BASE64=$(grep 'config-json' /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          CONFIG_JSON=$(echo ${CONFIG_JSON_BASE64} | base64 --decode)
          DCGM_EXPORT_PUBLIC=$(echo "${CONFIG_JSON}" | jq -r '.export_dcgm_to_public // empty')
    
          DCGM_EXPORTER_IMAGE="$REGISTRY_URI_PATH/nvidia/k8s/dcgm-exporter"
          DCGM_EXPORTER_VERSION="3.2.5-3.1.8-ubuntu22.04"
          if [ -z "${DCGM_EXPORT_PUBLIC}" ] || [ "${DCGM_EXPORT_PUBLIC}" != "true" ]; then
            echo "Info: launching DCGM Exporter to collect vGPU metrics, listening only on localhost (127.0.0.1:9400)"
            docker run -d --gpus all --cap-add SYS_ADMIN -p 127.0.0.1:9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
          else
            echo "Info: launching DCGM Exporter to collect vGPU metrics, exposed on all network interfaces (0.0.0.0:9400)"
            docker run -d --gpus all --cap-add SYS_ADMIN -p 9400:9400 $DCGM_EXPORTER_IMAGE:$DCGM_EXPORTER_VERSION
          fi
        }
  • Geben Sie die Installationseigenschaften des vGPU-Gasttreibers ein, wie z. B. vgpu-license und nvidia-portal-api-key.
  • Geben Sie nach Bedarf Werte für die Eigenschaften an, die für eine getrennte Umgebung erforderlich sind.

Weitere Informationen finden Sie unter OVF-Eigenschaften von Deep Learning-VMs.

Ausgabe
  • Installationsprotokolle für den vGPU-Gasttreiber in /var/log/vgpu-install.log.

    Um zu überprüfen, ob der vGPU-Gasttreiber installiert ist, melden Sie sich über SSH bei der VM an und führen Sie den Befehl nvidia-smi aus.

  • Cloud-init-Skriptprotokolle in /var/log/dl.log.

    Um den Fortschritt der Bereitstellung zu verfolgen, führen Sie tail -f /var/log/dl.log aus.

  • Ein Beispiel für eine Chatbot-Webanwendung, auf die Sie unter http://dl_vm_ip:3001 zugreifen können.

    Sie können Ihre eigene Knowledgebase hochladen.

Zuweisen einer statischen IP-Adresse zu einer Deep Learning-VM in VMware Private AI Foundation with NVIDIA

Standardmäßig werden die Deep Learning-VM-Images auf Zuweisung von Adressen mittels DHCP konfiguriert. Wenn Sie eine Deep Learning-VM mit einer statischen IP-Adresse direkt auf einem vSphere-Cluster bereitstellen möchten, müssen Sie dem Abschnitt „cloud-init“ zusätzlichen Code hinzufügen.

Auf vSphere with Tanzu wird die Zuweisung von IP-Adressen durch die Netzwerkkonfiguration für den Supervisor in NSX bestimmt.

Prozedur

  1. Erstellen Sie ein cloud-init-Skript im Klartextformat für die DL-Arbeitslast, die Sie verwenden möchten.
  2. Fügen Sie dem cloud-init-Skript den folgenden zusätzlichen Code hinzu.
    #cloud-config
    <instructions_for_your_DL_workload>
    
    manage_etc_hosts: true
     
    write_files:
      - path: /etc/netplan/50-cloud-init.yaml
        permissions: '0600'
        content: |
          network:
            version: 2
            renderer: networkd
            ethernets:
              ens33:
                dhcp4: false # disable DHCP4
                addresses: [x.x.x.x/x]  # Set the static IP address and mask
                routes:
                    - to: default
                      via: x.x.x.x # Configure gateway
                nameservers:
                  addresses: [x.x.x.x, x.x.x.x] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
     
    runcmd:
      - netplan apply
  3. Codieren Sie das resultierende cloud-init-Skript im base64-Format.
  4. Legen Sie das resultierende cloud-init-Skript im base64-Format als Wert auf den user-data-OVF-Parameter des Deep Learning-VM-Images fest.

Beispiel: Zuweisen einer statischen IP-Adresse zu einer CUDA-Beispielarbeitslast

Für ein Beispiel für eine Deep Learning-VM mit einer CUDA-Beispiel-DL-Arbeitslast:

Deep Learning-VM-Element Beispielwert
DL-Arbeitslast-Image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
IP-Adresse 10.199.118.245
Subnetzpräfix /25
Gateway 10.199.118.253
DNS-Server
  • 10.142.7.1
  • 10.132.7.1

Stellen Sie den folgenden cloud-init-Code bereit:

I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIG52Y3IuaW8vbnZpZGlhL2s4cy9jdWRhLXNhbXBsZTp2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4CgptYW5hZ2VfZXRjX2hvc3RzOiB0cnVlCiAKd3JpdGVfZmlsZXM6CiAgLSBwYXRoOiAvZXRjL25ldHBsYW4vNTAtY2xvdWQtaW5pdC55YW1sCiAgICBwZXJtaXNzaW9uczogJzA2MDAnCiAgICBjb250ZW50OiB8CiAgICAgIG5ldHdvcms6CiAgICAgICAgdmVyc2lvbjogMgogICAgICAgIHJlbmRlcmVyOiBuZXR3b3JrZAogICAgICAgIGV0aGVybmV0czoKICAgICAgICAgIGVuczMzOgogICAgICAgICAgICBkaGNwNDogZmFsc2UgIyBkaXNhYmxlIERIQ1A0CiAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE5OS4xMTguMjQ1LzI1XSAgIyBTZXQgdGhlIHN0YXRpYyBJUCBhZGRyZXNzIGFuZCBtYXNrCiAgICAgICAgICAgIHJvdXRlczoKICAgICAgICAgICAgICAgIC0gdG86IGRlZmF1bHQKICAgICAgICAgICAgICAgICAgdmlhOiAxMC4xOTkuMTE4LjI1MyAjIENvbmZpZ3VyZSBnYXRld2F5CiAgICAgICAgICAgIG5hbWVzZXJ2ZXJzOgogICAgICAgICAgICAgIGFkZHJlc3NlczogWzEwLjE0Mi43LjEsIDEwLjEzMi43LjFdICMgUHJvdmlkZSB0aGUgRE5TIHNlcnZlciBhZGRyZXNzLiBTZXBhcmF0ZSBtdWxpdHBsZSBETlMgc2VydmVyIGFkZHJlc3NlcyB3aXRoIGNvbW1hcy4KIApydW5jbWQ6CiAgLSBuZXRwbGFuIGFwcGx5

was dem folgenden Skript im Klartextformat entspricht:

#cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

manage_etc_hosts: true
 
write_files:
  - path: /etc/netplan/50-cloud-init.yaml
    permissions: '0600'
    content: |
      network:
        version: 2
        renderer: networkd
        ethernets:
          ens33:
            dhcp4: false # disable DHCP4
            addresses: [10.199.118.245/25]  # Set the static IP address and mask
            routes:
                - to: default
                  via: 10.199.118.253 # Configure gateway
            nameservers:
              addresses: [10.142.7.1, 10.132.7.1] # Provide the DNS server address. Separate mulitple DNS server addresses with commas.
 
runcmd:
  - netplan apply

Bereitstellen einer Deep Learning-VM mit einer Proxyserver

Zum Herstellen einer Verbindung zwischen der Deep Learning-VM und dem Internet in einer getrennten Umgebung, in der über einen Proxyserver auf das Internet zugegriffen wird, müssen Sie die Details des Proxyservers in der Datei config.json auf der virtuellen Maschine bereitstellen.

Prozedur

  1. Erstellen Sie eine JSON-Datei mit den Eigenschaften für den Proxyserver.
    Proxyserver, der keine Authentifizierung benötigt
    {  
      "http_proxy": "protocol://ip-address-or-fqdn:port",
      "https_proxy": "protocol://ip-address-or-fqdn:port"
    }
    Proxyserver, der Authentifizierung benötigt
    {  
      "http_proxy": "protocol://username:password@ip-address-or-fqdn:port",
      "https_proxy": "protocol://username:password@ip-address-or-fqdn:port"
    }

    wobei:

    • protocol als das vom Proxyserver verwendete Protokoll fungiert, wie z. B. http oder https.
    • username und password als Anmeldedaten für die Authentifizierung beim Proxyserver fungieren. Wenn der Proxyserver keine Authentifizierung benötigt, überspringen Sie diese Parameter.
    • ip-address-or-fqdn: Die IP-Adresse oder der Hostname des Proxyservers.
    • port: Die Portnummer, für die der Proxyserver eingehende Anforderungen überwacht.
  2. Kodieren Sie den JSON-Ergebniscode im base64-Format.
  3. Wenn Sie das Deep Learning-VM-Image bereitstellen, fügen Sie den kodierten Wert zur OVF-Eigenschaft config-json hinzu.