You can provision a deep learning virtual machine with a supported deep learning (DL) workload in addition to its embedded components. The DL workloads are downloaded from the NVIDIA NGC catalog and are GPU-optimized and validated by NVIDIA and VMware by Broadcom.

For an overview of the deep learning VM images, see About Deep Learning VM Images in VMware Private AI Foundation with NVIDIA.

CUDA Sample

You can use a deep learning VM with running CUDA samples to explore vector addition, gravitational n-body simulation, or other examples on a VM. See the CUDA Samples page.

After the deep learning VM is launched, it runs a CUDA sample workload to test the vGPU guest driver. You can examine the test output in the /var/log/dl.log file.

Table 1. CUDA Sample Container Image
Component Description
Container image
nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag
For example:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8

For information on the CUDA Sample container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a CUDA Sample workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the CUDA Sample image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag

      For example, for vectoradd-cuda11.7.1-ubi8, provide the following script in base64 format:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIG52Y3IuaW8vbnZpZGlhL2s4cy9jdWRhLXNhbXBsZTp2ZWN0b3JhZGQtY3VkYTExLjcuMS11Ymk4

      which corresponds to the following script in plain-text format:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
      
    • Image one-liner. Encode it in base64 format
      docker run -d nvcr.io/nvidia/k8s/cuda-sample:ngc_image_tag

      For example, for vectoradd-cuda11.7.1-ubi8, provide the following script in base64 format:

      ZG9ja2VyIHJ1biAtZCBudmNyLmlvL252aWRpYS9rOHMvY3VkYS1zYW1wbGU6dmVjdG9yYWRkLWN1ZGExMS43LjEtdWJpOA==

      which corresponds to the following script in plain-text format:

      docker run -d nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
  • Enter the vGPU guest driver installation properties.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, and the license is allocated, run the following command:

    nvidia-smi -q |grep -i license
  • Cloud-init script logs in /var/log/dl.log.

PyTorch

You can use a deep learning VM with a PyTorch library to explore conversational AI, NLP, and other types of AI models, on a VM. See the PyTorch page.

After the deep learning VM is launched, it starts a JupyterLab instance with PyTorch packages installed and configured.

Table 2. PyTorch Container Image
Component Description
Container image
nvcr.io/nvidia/pytorch:ngc_image_tag
For example:
nvcr.io/nvidia/pytorch:23.10-py3

For information on the PyTorch container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a PyTorch workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the PyTorch image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
      

      For example, for pytorch:23.10-py3, provide the following script in base 64 format:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIC1wIDg4ODg6ODg4OCBudmNyLmlvL252aWRpYS9weXRvcmNoOjIzLjEwLXB5MyAvdXNyL2xvY2FsL2Jpbi9qdXB5dGVyIGxhYiAtLWFsbG93LXJvb3QgLS1pcD0qIC0tcG9ydD04ODg4IC0tbm8tYnJvd3NlciAtLU5vdGVib29rQXBwLnRva2VuPScnIC0tTm90ZWJvb2tBcHAuYWxsb3dfb3JpZ2luPScqJyAtLW5vdGVib29rLWRpcj0vd29ya3NwYWNl

      which corresponds to the following script in plain-text format.

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch:23.10-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
      
    • Image one-liner. Encode it in base64 format.
      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      For example, for pytorch:23.10-py3, provide the following script in base 64 format:

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvcHl0b3JjaDoyMy4xMC1weTMgL3Vzci9sb2NhbC9iaW4vanVweXRlciBsYWIgLS1hbGxvdy1yb290IC0taXA9KiAtLXBvcnQ9ODg4OCAtLW5vLWJyb3dzZXIgLS1Ob3RlYm9va0FwcC50b2tlbj0nJyAtLU5vdGVib29rQXBwLmFsbG93X29yaWdpbj0nKicgLS1ub3RlYm9vay1kaXI9L3dvcmtzcGFjZQ==

      which corresponds to the following script in plain-text format:

      docker run -d -p 8888:8888 nvcr.io/nvidia/pytorch:23.10-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Enter the vGPU guest driver installation properties.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.
  • PyTorch container.

    To verify that the PyTorch container is running, run the sudo docker ps -a and sudo docker logs container_id command.

  • JupyterLab instance that you can access at http://dl_vm_ip:8888

    In the terminal of JupyterLab, verify that the following functionality is available in the notebook:

    • To verify that JupyterLab can access the vGPU resource, run nvidia-smi.
    • To verify that the PyTorch related packages are installed, run pip show.

TensorFlow

You can use a deep learning VM with a TensorFlow library to explore conversational AI, NLP, and other types of AI models, on a VM. See the TensorFlow page.

After the deep learning VM is launched, it starts a JupyterLab instance with TensorFlow packages installed and configured.

Table 3. TensorFlow Container Image
Component Description
Container image
nvcr.io/nvidia/tensorflow:ngc_image_tag

For example:

nvcr.io/nvidia/tensorflow:23.10-tf2-py3

For information on the TensorFlow container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a TensorFlow workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the TensorFlow image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      For example, for tensorflow:23.10-tf2-py3, provide the following script in base64 format:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIC1wIDg4ODg6ODg4OCBudmNyLmlvL252aWRpYS90ZW5zb3JmbG93OjIzLjEwLXRmMi1weTMgL3Vzci9sb2NhbC9iaW4vanVweXRlciBsYWIgLS1hbGxvdy1yb290IC0taXA9KiAtLXBvcnQ9ODg4OCAtLW5vLWJyb3dzZXIgLS1Ob3RlYm9va0FwcC50b2tlbj0nJyAtLU5vdGVib29rQXBwLmFsbG93X29yaWdpbj0nKicgLS1ub3RlYm9vay1kaXI9L3dvcmtzcGFjZQ==

      which corresponds to the following script in plain-text format:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow:23.10-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
    • Image one-liner. Encode it in base64 format.
      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow:ngc_image_tag /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace

      For example, for tensorflow:23.10-tf2-py3, provide the following script in base64 format:

      ZG9ja2VyIHJ1biAtZCAtcCA4ODg4Ojg4ODggbnZjci5pby9udmlkaWEvdGVuc29yZmxvdzoyMy4xMC10ZjItcHkzIC91c3IvbG9jYWwvYmluL2p1cHl0ZXIgbGFiIC0tYWxsb3ctcm9vdCAtLWlwPSogLS1wb3J0PTg4ODggLS1uby1icm93c2VyIC0tTm90ZWJvb2tBcHAudG9rZW49JycgLS1Ob3RlYm9va0FwcC5hbGxvd19vcmlnaW49JyonIC0tbm90ZWJvb2stZGlyPS93b3Jrc3BhY2U=

      which corresponds to the following script in plain-text format:

      docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow:23.10-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  • Enter the vGPU guest driver installation properties.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, log in to the VM over SSH and run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.
  • TensorFlow container.

    To verify that the TensorFlow container is running, run the sudo docker ps -a and sudo docker logs container_id commands.

  • JupyterLab instance that you can access at http://dl_vm_ip:8888.

    In the terminal of JupyterLab, verify that the following functionality is available in the notebook:

    • To verify that JupyterLab can access the vGPU resource, run nvidia-smi.
    • To verify that the TensorFlow related packages are installed, run pip show.

DCGM Exporter

You can use a deep learning VM with a Data Center GPU Manager (DCGM) exporter to monitor the health of and get metrics from GPUs used by a DL workload, using NVIDIA DCGM, Prometheus, and Grafana.

See the DCGM Exporter page.

In a deep learning VM, you run the DCGM Exporter container together with a DL workload that performs AI operations. After the deep learning VM is started, DCGM Exporter is ready to collect vGPU metrics and export the data to another application for further monitoring and visualization. You can run the monitored DL workload as a part of the cloud-init process or from the command line after the virtual machine is started.

Table 4. DCGM Exporter Container Image
Component Description
Container image
nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag

For example:

nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04

For information on the DCGM Exporter container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a DCGM Exporter workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the DCGM Exporter image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag-ubuntu22.04
          

      For example, for a deep learning VM with a pre-installed a dcgm-exporter:3.2.5-3.1.8-ubuntu22.04 DCGM Exporter instance, provide the following script in base64 format

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLS1jYXAtYWRkIFNZU19BRE1JTiAtLXJtIC1wIDk0MDA6OTQwMCBudmNyLmlvL252aWRpYS9rOHMvZGNnbS1leHBvcnRlcjozLjIuNS0zLjEuOC11YnVudHUyMi4wNA==
      which corresponds to the following script in plain-text format:
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
      
      Note: You can also add the instructions for running the DL workload whose GPU performance you want to measure with DCGM Exporter to the cloud-init script.
    • Image one-liner. Encode it in base64 format.
      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:ngc_image_tag-ubuntu22.04

      For example, for dcgm-exporter:3.2.5-3.1.8-ubuntu22.04, provide the following script in base64 format:

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tY2FwLWFkZCBTWVNfQURNSU4gLS1ybSAtcCA5NDAwOjk0MDAgbnZjci5pby9udmlkaWEvazhzL2RjZ20tZXhwb3J0ZXI6My4yLjUtMy4xLjgtdWJ1bnR1MjIuMDQ=

      which corresponds to the following script in plain-text format:

      docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
  • Enter the vGPU guest driver installation properties.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, log in to the VM over SSH and run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.
  • DCGM Exporter that you can access at http://dl_vm_ip:9400.

Next, in the deep learning VM, you run a DL workload, and visualize the data on another virtual machine by using Prometheus at http://visualization_vm_ip:9090 and Grafana at http://visualization_vm_ip:3000.

Run a DL Workload on the Deep Leaning VM

Run the DL workload you want to collect vGPU metrics for and export the data to another application for further monitoring and visualization.

  1. Log in to the deep learning VM as vmware over SSH.
  2. Add the vmware user account to the docker group by running the following command.
    sudo usermod -aG docker ${USER}
  3. Run the container for the DL workload, pulling it from the NVIDIA NGC catalog or from a local container registry.

    For example, to run the following command to run the tensorflow:23.10-tf2-py3 image from NVIDIA NGC:

    docker run -d -p 8888:8888 nvcr.io/nvidia/tensorflow:23.10-tf2-py3 /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/workspace
  4. Start using the DL workload for AI development.

Install Prometheus and Grafana

You can visualize and monitor the vGPU metrics from the DCGM Exporter virtual machine on a virtual machine running Prometheus and Grafana.

  1. Create a visualization VM with Docker Community Engine installed.
  2. Connect to the VM over SSH and create a YAML file for Prometheus.
    $ cat > prometheus.yml << EOF
    global:
      scrape_interval: 15s
      external_labels:
        monitor: 'codelab-monitor'
    scrape_configs:
      - job_name: 'dcgm'
        scrape_interval: 5s
        metrics_path: /metrics
        static_configs:
          - targets: [dl_vm_with_dcgm_exporter_ip:9400']
    EOF
    
  3. Create a data path.
    $ mkdir grafana_data prometheus_data && chmod 777 grafana_data prometheus_data
    
  4. Create a Docker compose file to install Prometheus and Grafana.
    $ cat > compose.yaml << EOF
    services:
      prometheus:
        image: prom/prometheus:v2.47.2
        container_name: "prometheus0"
        restart: always
        ports:
          - "9090:9090"
        volumes:
          - "./prometheus.yml:/etc/prometheus/prometheus.yml"
          - "./prometheus_data:/prometheus"
      grafana:
        image: grafana/grafana:10.2.0-ubuntu
        container_name: "grafana0"
        ports:
          - "3000:3000"
        restart: always
        volumes:
          - "./grafana_data:/var/lib/grafana"
    EOF
    
  5. Start the Prometheus and Grafana containers.
    $ sudo docker compose up -d        
    

View vGPU Metrics in Prometheus

You can access Prometheus at http://visualization-vm-ip:9090. You can view the following vGPU information in the Prometheus UI:

Information UI Section
Raw vGPU metrics from the deep learning VM Status > Target

To view the raw vGPU metrics from the deep learning VM, click the endpoint entry.

Graph expressions
  1. On the main navigation bar, click the Graph tab.
  2. Enter an expression and click Execute

For more information on using Prometheus, see the Prometheus documentation.

Visualize Metrics in Grafana

Set Prometheus as a data source for Grafana and visualize the vGPU metrics from the deep learning VM in a dashboard.

  1. Access Grafana at http://visualization-vm-ip:3000 by using the default user name admin and password admin.
  2. Add Prometheus as the first data source, connecting to visualization-vm-ip on port 9090.
  3. Create a dashboard with the vGPU metrics.

For more information on configuring a dashboard using a Prometheus data source, see the Grafana documentation.

Triton Inference Server

You can use a deep learning VM with a Triton Inference Server for loading a model repository and receive inference requests.

See the Triton Inference Server page.

Table 5. Triton Inference Server Container Image
Component Description
Container image
nvcr.io/nvidia/tritonserver:ngc_image_tag

For example:

nvcr.io/nvidia/tritonserver:23.10-py3

For information on the Triton Inference Server container images that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy a Triton Inference Server workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Use one of the following properties that are specific for the Triton Inference Server image.
    • Cloud-init script. Encode it in base64 format.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver:ngc_image_tag tritonserver --model-repository=/models --model-control-mode=poll
      

      For example, for tritonserver:23.10-py3, provide the following script in base64 format:

      I2Nsb3VkLWNvbmZpZwp3cml0ZV9maWxlczoKLSBwYXRoOiAvb3B0L2Rsdm0vZGxfYXBwLnNoCiAgcGVybWlzc2lvbnM6ICcwNzU1JwogIGNvbnRlbnQ6IHwKICAgICMhL2Jpbi9iYXNoCiAgICBkb2NrZXIgcnVuIC1kIC0tZ3B1cyBhbGwgLS1ybSAtcDgwMDA6ODAwMCAtcDgwMDE6ODAwMSAtcDgwMDI6ODAwMiAtdiAvaG9tZS92bXdhcmUvbW9kZWxfcmVwb3NpdG9yeTovbW9kZWxzIG52Y3IuaW8vbnZpZGlhL3RyaXRvbnNlcnZlcjpuZ2NfaW1hZ2VfdGFnIHRyaXRvbnNlcnZlciAtLW1vZGVsLXJlcG9zaXRvcnk9L21vZGVscyAtLW1vZGVsLWNvbnRyb2wtbW9kZT1wb2xs

      which corresponds to the following script in plain-text format:

      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/models --model-control-mode=poll
      
    • Image one-liner encoded in base64 format
      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver:ngc_image_tag tritonserver --model-repository=/models --model-control-mode=poll

      For example, for tritonserver:23.10-py3, provide the following script in base 64 format:

      ZG9ja2VyIHJ1biAtZCAtLWdwdXMgYWxsIC0tcm0gLXA4MDAwOjgwMDAgLXA4MDAxOjgwMDEgLXA4MDAyOjgwMDIgLXYgL2hvbWUvdm13YXJlL21vZGVsX3JlcG9zaXRvcnk6L21vZGVscyBudmNyLmlvL252aWRpYS90cml0b25zZXJ2ZXI6MjMuMTAtcHkzIHRyaXRvbnNlcnZlciAtLW1vZGVsLXJlcG9zaXRvcnk9L21vZGVscyAtLW1vZGVsLWNvbnRyb2wtbW9kZT1wb2xs

      which corresponds to the following script in plain-text format:

      docker run -d --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/vmware/model_repository:/models nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/models --model-control-mode=poll
  • Enter the vGPU guest driver installation properties.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, log in to the VM over SSH and run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.
  • Triton Inference Server container.

    To verify that the Triton Inference Server container is running, run the sudo docker ps -a and sudo docker logs container_id commands.

The model repository for the Triton Inference Server is in /home/vmware/model_repository. Initially, the model repository is empty and the initial log of the Triton Inference Server instance shows that no model is loaded.

Create a Model Repository

To load your model for model inference, perform these steps:

  1. Create the model repository for your model.

    See the NVIDIA Triton Inference Server Model Repository documentation .

  2. Copy the model repository to /home/vmware/model_repository so that the Triton Inference Server can load it.
    sudo cp -r path_to_your_created_model_repository/* /home/vmware/model_repository/
    

Send Model Inference Requests

  1. Verify that the Triton Inference Server is healthy and models are ready by running this command in the deep learning VM console.
    curl -v localhost:8000/v2/simple_sequence
  2. Send a request to the model by running this command on the deep learning VM.
     curl -v localhost:8000/v2/models/simple_sequence

For more information on using the Triton Inference Server, see NVIDIA Triton Inference Server Model Repository documentation.

NVIDIA RAG

You can use a deep learning VM to build Retrieval Augmented Generation (RAG) solutions with an Llama2 model.

See the AI Chatbot with Retrieval Augmented Generation documentation.

Table 6. NVIDIA RAG Container Image
Component Description
Container images and models
rag-app-text-chatbot.yaml
in the NVIDIA sample RAG pipeline.

For information on the NVIDIA RAG container applications that are supported for deep learning VMs, see VMware Deep Learning VM Release Notes.

Required inputs To deploy an NVIDIA RAG workload, you must set the OVF properties for the deep learning virtual machine in the following way:
  • Enter a cloud-init script. Encode it in base64 format.

    For example, for version 24.03 of NVIDIA RAG, provide the following script:

    #cloud-config
write_files:
- path: /opt/dlvm/dl_app.sh
  permissions: '0755'
  content: |
    #!/bin/bash
    error_exit() {
      echo "Error: $1" >&2
      exit 1
    }

    cat <<EOF > /opt/dlvm/config.json
    {
      "_comment": "This provides default support for RAG: TensorRT inference, llama2-13b model, and H100x2 GPU",
      "rag": {
        "org_name": "cocfwga8jq2c",
        "org_team_name": "no-team",
        "rag_repo_name": "nvidia/paif",
        "llm_repo_name": "nvidia/nim",
        "embed_repo_name": "nvidia/nemo-retriever",
        "rag_name": "rag-docker-compose",
        "rag_version": "24.03",
        "embed_name": "nv-embed-qa",
        "embed_type": "NV-Embed-QA",
        "embed_version": "4",
        "inference_type": "trt",
        "llm_name": "llama2-13b-chat",
        "llm_version": "h100x2_fp16_24.02",
        "num_gpu": "2",
        "hf_token": "huggingface token to pull llm model, update when using vllm inference",
        "hf_repo": "huggingface llm model repository, update when using vllm inference"
      }
    }
    EOF
    CONFIG_JSON=$(cat "/opt/dlvm/config.json")
    INFERENCE_TYPE=$(echo "${CONFIG_JSON}" | jq -r '.rag.inference_type')
    if [ "${INFERENCE_TYPE}" = "trt" ]; then
      required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "LLM_VERSION" "NUM_GPU")
    elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
      required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "NUM_GPU" "HF_TOKEN" "HF_REPO")
    else
      error_exit "Inference type '${INFERENCE_TYPE}' is not recognized. No action will be taken."
    fi
    for index in "${!required_vars[@]}"; do
      key="${required_vars[$index]}"
      jq_query=".rag.${key,,} | select (.!=null)"
      value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
      if [[ -z "${value}" ]]; then 
        error_exit "${key} is required but not set."
      else
        eval ${key}=\""${value}"\"
      fi
    done

    RAG_URI="${RAG_REPO_NAME}/${RAG_NAME}:${RAG_VERSION}"
    LLM_MODEL_URI="${LLM_REPO_NAME}/${LLM_NAME}:${LLM_VERSION}"
    EMBED_MODEL_URI="${EMBED_REPO_NAME}/${EMBED_NAME}:${EMBED_VERSION}"

    NGC_CLI_VERSION="3.41.2"
    NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"

    mkdir -p /opt/data
    cd /opt/data

    if [ ! -f .file_downloaded ]; then
      # clean up
      rm -rf compose.env ${RAG_NAME}* ${LLM_NAME}* ngc* ${EMBED_NAME}* *.json .file_downloaded

      # install ngc-cli
      wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip ngccli_linux.zip
      export PATH=`pwd`/ngc-cli:${PATH}

      APIKEY=""
      REG_URI="nvcr.io"

      if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${REG_URI}"* ]]; then
        APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      fi

      if [ -z "${APIKEY}" ]; then
          error_exit "No APIKEY found"
      fi

      # config ngc-cli
      mkdir -p ~/.ngc

      cat << EOF > ~/.ngc/config
      [CURRENT]
      apikey = ${APIKEY}
      format_type = ascii
      org = ${ORG_NAME}
      team = ${ORG_TEAM_NAME}
      ace = no-ace
    EOF

      # ngc docker login
      docker login nvcr.io -u \$oauthtoken -p ${APIKEY}

      # dockerhub login for general components, e.g. minio
      DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')

      if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
        docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD}
      else
        echo "Warning: DockerHub not login"
      fi

      # get RAG files
      ngc registry resource download-version ${RAG_URI}

      # get llm model
      if [ "${INFERENCE_TYPE}" = "trt" ]; then
        ngc registry model download-version ${LLM_MODEL_URI}
        chmod -R o+rX ${LLM_NAME}_v${LLM_VERSION}
        LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}_v${LLM_VERSION}"
      elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
        pip install huggingface_hub
        huggingface-cli login --token ${HF_TOKEN}
        huggingface-cli download --resume-download ${HF_REPO}/${LLM_NAME} --local-dir ${LLM_NAME} --local-dir-use-symlinks False
        LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}"
        cat << EOF > ${LLM_MODEL_FOLDER}/model_config.yaml 
        engine:
          model: /model-store
          enforce_eager: false
          max_context_len_to_capture: 8192
          max_num_seqs: 256
          dtype: float16
          tensor_parallel_size: ${NUM_GPU}
          gpu_memory_utilization: 0.8
    EOF
        chmod -R o+rX ${LLM_MODEL_FOLDER}
        python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"> rag-app-text-chatbot.json
        jq '.services."nemollm-inference".image = "nvcr.io/nvidia/nim/nim_llm:24.02-day0" |
            .services."nemollm-inference".command = "nim_vllm --model_name ${MODEL_NAME} --model_config /model-store/model_config.yaml" |
            .services."nemollm-inference".ports += ["8000:8000"] |
            .services."nemollm-inference".expose += ["8000"]' rag-app-text-chatbot.json > temp.json && mv temp.json rag-app-text-chatbot.json
        python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-text-chatbot.json > "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"
      fi

      # get embedding models
      ngc registry model download-version ${EMBED_MODEL_URI}
      chmod -R o+rX ${EMBED_NAME}_v${EMBED_VERSION}

      # config compose.env
      cat << EOF > compose.env
      export MODEL_DIRECTORY="${LLM_MODEL_FOLDER}"
      export MODEL_NAME=${LLM_NAME}
      export NUM_GPU=${NUM_GPU}
      export APP_CONFIG_FILE=/dev/null
      export EMBEDDING_MODEL_DIRECTORY="/opt/data/${EMBED_NAME}_v${EMBED_VERSION}"
      export EMBEDDING_MODEL_NAME=${EMBED_TYPE}
      export EMBEDDING_MODEL_CKPT_NAME="${EMBED_TYPE}-${EMBED_VERSION}.nemo"
    EOF

      touch .file_downloaded
    fi

    # start NGC RAG
    docker compose -f ${RAG_NAME}_v${RAG_VERSION}/docker-compose-vectordb.yaml up -d pgvector
    source compose.env; docker compose -f ${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml up -d
    which corresponds to the following script in plain-text format:
    #cloud-config
    write_files:
    - path: /opt/dlvm/dl_app.sh
      permissions: '0755'
      content: |
        #!/bin/bash
        error_exit() {
          echo "Error: $1" >&2
          exit 1
        }
    
        cat <<EOF > /opt/dlvm/config.json
        {
          "_comment": "This provides default support for RAG: TensorRT inference, llama2-13b model, and H100x2 GPU",
          "rag": {
            "org_name": "cocfwga8jq2c",
            "org_team_name": "no-team",
            "rag_repo_name": "nvidia/paif",
            "llm_repo_name": "nvidia/nim",
            "embed_repo_name": "nvidia/nemo-retriever",
            "rag_name": "rag-docker-compose",
            "rag_version": "24.03",
            "embed_name": "nv-embed-qa",
            "embed_type": "NV-Embed-QA",
            "embed_version": "4",
            "inference_type": "trt",
            "llm_name": "llama2-13b-chat",
            "llm_version": "h100x2_fp16_24.02",
            "num_gpu": "2",
            "hf_token": "huggingface token to pull llm model, update when using vllm inference",
            "hf_repo": "huggingface llm model repository, update when using vllm inference"
          }
        }
        EOF
        CONFIG_JSON=$(cat "/opt/dlvm/config.json")
        INFERENCE_TYPE=$(echo "${CONFIG_JSON}" | jq -r '.rag.inference_type')
        if [ "${INFERENCE_TYPE}" = "trt" ]; then
          required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "LLM_VERSION" "NUM_GPU")
        elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
          required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "NUM_GPU" "HF_TOKEN" "HF_REPO")
        else
          error_exit "Inference type '${INFERENCE_TYPE}' is not recognized. No action will be taken."
        fi
        for index in "${!required_vars[@]}"; do
          key="${required_vars[$index]}"
          jq_query=".rag.${key,,} | select (.!=null)"
          value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
          if [[ -z "${value}" ]]; then 
            error_exit "${key} is required but not set."
          else
            eval ${key}=\""${value}"\"
          fi
        done
    
        RAG_URI="${RAG_REPO_NAME}/${RAG_NAME}:${RAG_VERSION}"
        LLM_MODEL_URI="${LLM_REPO_NAME}/${LLM_NAME}:${LLM_VERSION}"
        EMBED_MODEL_URI="${EMBED_REPO_NAME}/${EMBED_NAME}:${EMBED_VERSION}"
    
        NGC_CLI_VERSION="3.41.2"
        NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"
    
        mkdir -p /opt/data
        cd /opt/data
    
        if [ ! -f .file_downloaded ]; then
          # clean up
          rm -rf compose.env ${RAG_NAME}* ${LLM_NAME}* ngc* ${EMBED_NAME}* *.json .file_downloaded
    
          # install ngc-cli
          wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip ngccli_linux.zip
          export PATH=`pwd`/ngc-cli:${PATH}
    
          APIKEY=""
          REG_URI="nvcr.io"
    
          if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${REG_URI}"* ]]; then
            APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          fi
    
          if [ -z "${APIKEY}" ]; then
              error_exit "No APIKEY found"
          fi
    
          # config ngc-cli
          mkdir -p ~/.ngc
    
          cat << EOF > ~/.ngc/config
          [CURRENT]
          apikey = ${APIKEY}
          format_type = ascii
          org = ${ORG_NAME}
          team = ${ORG_TEAM_NAME}
          ace = no-ace
        EOF
    
          # ngc docker login
          docker login nvcr.io -u \$oauthtoken -p ${APIKEY}
    
          # dockerhub login for general components, e.g. minio
          DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
          DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
    
          if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
            docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD}
          else
            echo "Warning: DockerHub not login"
          fi
    
          # get RAG files
          ngc registry resource download-version ${RAG_URI}
    
          # get llm model
          if [ "${INFERENCE_TYPE}" = "trt" ]; then
            ngc registry model download-version ${LLM_MODEL_URI}
            chmod -R o+rX ${LLM_NAME}_v${LLM_VERSION}
            LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}_v${LLM_VERSION}"
          elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
            pip install huggingface_hub
            huggingface-cli login --token ${HF_TOKEN}
            huggingface-cli download --resume-download ${HF_REPO}/${LLM_NAME} --local-dir ${LLM_NAME} --local-dir-use-symlinks False
            LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}"
            cat << EOF > ${LLM_MODEL_FOLDER}/model_config.yaml 
            engine:
              model: /model-store
              enforce_eager: false
              max_context_len_to_capture: 8192
              max_num_seqs: 256
              dtype: float16
              tensor_parallel_size: ${NUM_GPU}
              gpu_memory_utilization: 0.8
        EOF
            chmod -R o+rX ${LLM_MODEL_FOLDER}
            python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"> rag-app-text-chatbot.json
            jq '.services."nemollm-inference".image = "nvcr.io/nvidia/nim/nim_llm:24.02-day0" |
                .services."nemollm-inference".command = "nim_vllm --model_name ${MODEL_NAME} --model_config /model-store/model_config.yaml" |
                .services."nemollm-inference".ports += ["8000:8000"] |
                .services."nemollm-inference".expose += ["8000"]' rag-app-text-chatbot.json > temp.json && mv temp.json rag-app-text-chatbot.json
            python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-text-chatbot.json > "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"
          fi
    
          # get embedding models
          ngc registry model download-version ${EMBED_MODEL_URI}
          chmod -R o+rX ${EMBED_NAME}_v${EMBED_VERSION}
    
          # config compose.env
          cat << EOF > compose.env
          export MODEL_DIRECTORY="${LLM_MODEL_FOLDER}"
          export MODEL_NAME=${LLM_NAME}
          export NUM_GPU=${NUM_GPU}
          export APP_CONFIG_FILE=/dev/null
          export EMBEDDING_MODEL_DIRECTORY="/opt/data/${EMBED_NAME}_v${EMBED_VERSION}"
          export EMBEDDING_MODEL_NAME=${EMBED_TYPE}
          export EMBEDDING_MODEL_CKPT_NAME="${EMBED_TYPE}-${EMBED_VERSION}.nemo"
        EOF
    
          touch .file_downloaded
        fi
    
        # start NGC RAG
        docker compose -f ${RAG_NAME}_v${RAG_VERSION}/docker-compose-vectordb.yaml up -d pgvector
        source compose.env; docker compose -f ${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml up -d
  • Enter the vGPU guest driver installation properties.
  • Provide values for the properties required for a disconnected environment as needed.

See OVF Properties of Deep Learning VMs.

Output
  • Installation logs for the vGPU guest driver in /var/log/vgpu-install.log.

    To verify that the vGPU guest driver is installed, log in to the VM over SSH and run the nvidia-smi command.

  • Cloud-init script logs in /var/log/dl.log.

    To track deployment progress, run tail -f /var/log/dl.log .

  • Sample chatbot Web application that you can access at http://dl_vm_ip:3001/orgs/nvidia/models/text-qa-chatbot

    You can upload your own knowledge base.