As a data scientist, you can use Automation Service Broker to deploy deep learning virtual machines for AI development.
When you request an AI workstation (VM) in the Automation Service Broker catalog, you provision a GPU-enabled deep learning VM that can be configured with the desired vCPU, vGPU, Memory, and AI/ML NGC containers from NVIDIA.
Deploy a deep learning virtual machine to a VI workload domain
As a data scientist, you can deploy a single GPU software-defined development environment from the self-service Automation Service Broker catalog.
Procedure
Results
Add DCGM Exporter for DL workload monitoring
You can use DCGM Exporter for monitoring a deep learning workload that uses GPU capacity.
DCGM-Exporter is an exporter for Prometheus that monitors the company's health and gets metrics from GPUs. It leverages DCGM using Go bindings to collect GPU telemetry and exposes GPU metrics to Prometheus using an HTTP endpoint (/metrics). DCGM-Exporter can be standalone or deployed as part of the NVIDIA GPU Operator.
Before you begin
Verify that you have successfully deployed a deep learning VM.
Procedure
- Log in to the deep learning VM over SSH.
For PyTorch and TensorFlow, log in from the JupyterLab notebook.
- Run the DCGM Exporter container by using the following command.
docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 registry-URI-path/nvidia/k8s/dcgm-exporter:ngc_image_tag
For example, to run dcgm-exporter:3.2.5-3.1.8-ubuntu22.04 from the NVIDIA NGC catalog, run the following command:docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
- After the DCGM Exporter installation is complete, visualize vGPU metrics in Prometheus and Grafana.
Deploy a GPU-enabled workstation with NVIDIA Triton Inference Server
As a data scientist, you can deploy a GPU-enabled workstation with NVIDIA Triton Inference Server from the self-service Automation Service Broker catalog.
The deployed workstation includes Ubuntu 22.04, an NVIDIA vGPU driver, Docker Engine, NVIDIA Container Toolkit, and NVIDIA Triton Inference Server.