Deploying RAG workloads

As a data scientist or a DevOps engineer, you can use Automation Service Broker to deploy NVIDIA RAG workloads.

Note: This documentation is based on VMware Aria Automation 8.18. For information about the VMware Private AI Foundation functionality in VMware Aria Automation 8.18.1, see Deploying RAG Workloads by Using Self-Service Catalog Items in VMware Aria Automation in the VMware Private AI Foundation with NVIDIA documentation.

Deploy a RAG workstation

As a data scientist, you can deploy a GPU-enabled workstation with Retrieval Augmented Generation (RAG) reference solution from the self-service Automation Service Broker catalog.

The RAG reference solution demonstrates how to find business value in generative AI by augmenting an existing foundational LLM to fit your business use case. This is done using retrieval augmented generation (RAG) which retrieves facts from an enterprise knowledge base containing your company business data. Pay special attention to ways in which you can augment an LLM with your domain-specific business data to create AI applications that are agile and responsive to new developments.

Procedure

On the Catalog page in Automation Service Broker, locate the AI RAG Workstation card and click Request.
Select a project.
Enter a name and description for your deployment.

Configure the RAG workstation parameters.

Setting	Sample value
VM class	`A100 Small - 1 vGPU (16 GB), 8 CPUs and 16 GB Memory` Minimum VM class specification: CPU: 10 vCPUs CPU RAM: 64 GB GPU: 2xH100 GPU memory: 50 GB
Data disk size	`3 Gi`
User password	Enter a password for the defalt user. You might be prompted to reset your password when you first log in.
SSH public key	This setting is optional.

Install software customizations.
1. (Optional) If you want to install a custom cloud-init in addition to the cloud-init defined for the RAG software bundle, select the checkbox and paste the contents of the configuration package.
  VMware Aria Automation merges the cloud-init from the RAG software bundle and the custom cloud-init.
2. Provide your NVIDIA NGC Portal access key.
3. Enter Docker Hub credentials.
Click Submit.

Results

Your workstation includes Ubuntu 22.04, an NVIDIA vGPU driver, a Docker Engine, and an NVIDIA Container Toolkit, and a reference RAG solution that uses the Llama-2-13b-chat model.

Deploy a GPU-accelerated Tanzu Kubernetes Grid RAG cluster

As a DevOps engineer using the self-service Automation Service Broker catalog, you can provision a GPU-enabled Tanzu Kubernetes Grid RAG cluster, where worker nodes can run a reference RAG solution that uses the Llama2-13b-chat model.

The deployment contains a Supervisor namespace and a Tanzu Kubernetes Grid cluster. The TKG cluster contains two Supervisor namespaces – one for the NVIDIA GPU Operator and the other for the NVIDIA RAG LLM Operator, both of which are preinstalled on the TKG cluster. Carvel applications for each operator are deployed inside these two namespaces.

Procedure

On the Catalog page in Automation Service Broker, locate the AI Kubernetes RAG Cluster card and click Request.
Select a project.
Enter a name and description for your deployment.

Select the number of control pane nodes.

Setting	Sample value
Node count	`1`
VM class	`best-effort-2xlarge - 8 CPUs and 64 GB Memory` The class selection defines the resources available within the virtual machine.

Select the number of work nodes.

Setting	Description
Node count	`3`
VM class	`best-effort-4xlarge-a100-40c - 1 vGPU (40 GB), 16 CPUs and 120 GB Memory` Minimum VM class specification: CPU: 10 vCPUs CPU RAM: 64 GB GPU: 2xH100 GPU memory: 50 GB
Time-slicing replicas	`1` Time-slicing defines a set of replicas for a GPU that is shared between workloads.

Provide the NVIDIA AI enterprise API key.
Click Submit.