È possibile distribuire una macchina virtuale di deep learning con un carico di lavoro NVIDIA RAG utilizzando un database PostgreSQL pgvector gestito da VMware Data Services Manager.

Prerequisiti

Procedura

  1. Se si distribuisce la macchina virtuale di deep learning direttamente nel cluster vSphere o utilizzando il comando kubectl, creare uno script cloud-init e distribuire la macchina virtuale di deep learning.
    1. Creare uno script cloud-init per NVIDIA RAG e il database PostgreSQL pgvector creato.
      È possibile modificare la versione iniziale dello script cloud-init per NVIDIA RAG. Ad esempio, per NVIDIA RAG 24.03 e un database PostgreSQL pgvector con dettagli di connessione postgres://pgvector_db_admin:encoded_pgvector_db_admin_password@pgvector_db_ip_address:5432/pgvector_db_name.
      #cloud-config
      write_files:
      - path: /opt/dlvm/dl_app.sh
        permissions: '0755'
        content: |
          #!/bin/bash
          error_exit() {
            echo "Error: $1" >&2
            exit 1
          }
      
          cat <<EOF > /opt/dlvm/config.json
          {
            "_comment": "This provides default support for RAG: TensorRT inference, llama2-13b model, and H100x2 GPU",
            "rag": {
              "org_name": "cocfwga8jq2c",
              "org_team_name": "no-team",
              "rag_repo_name": "nvidia/paif",
              "llm_repo_name": "nvidia/nim",
              "embed_repo_name": "nvidia/nemo-retriever",
              "rag_name": "rag-docker-compose",
              "rag_version": "24.03",
              "embed_name": "nv-embed-qa",
              "embed_type": "NV-Embed-QA",
              "embed_version": "4",
              "inference_type": "trt",
              "llm_name": "llama2-13b-chat",
              "llm_version": "h100x2_fp16_24.02",
              "num_gpu": "2",
              "hf_token": "huggingface token to pull llm model, update when using vllm inference",
              "hf_repo": "huggingface llm model repository, update when using vllm inference"
            }
          }
          EOF
          CONFIG_JSON=$(cat "/opt/dlvm/config.json")
          INFERENCE_TYPE=$(echo "${CONFIG_JSON}" | jq -r '.rag.inference_type')
          if [ "${INFERENCE_TYPE}" = "trt" ]; then
            required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "LLM_VERSION" "NUM_GPU")
          elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
            required_vars=("ORG_NAME" "ORG_TEAM_NAME" "RAG_REPO_NAME" "LLM_REPO_NAME" "EMBED_REPO_NAME" "RAG_NAME" "RAG_VERSION" "EMBED_NAME" "EMBED_TYPE" "EMBED_VERSION" "LLM_NAME" "NUM_GPU" "HF_TOKEN" "HF_REPO")
          else
            error_exit "Inference type '${INFERENCE_TYPE}' is not recognized. No action will be taken."
          fi
          for index in "${!required_vars[@]}"; do
            key="${required_vars[$index]}"
            jq_query=".rag.${key,,} | select (.!=null)"
            value=$(echo "${CONFIG_JSON}" | jq -r "${jq_query}")
            if [[ -z "${value}" ]]; then 
              error_exit "${key} is required but not set."
            else
              eval ${key}=\""${value}"\"
            fi
          done
      
          RAG_URI="${RAG_REPO_NAME}/${RAG_NAME}:${RAG_VERSION}"
          LLM_MODEL_URI="${LLM_REPO_NAME}/${LLM_NAME}:${LLM_VERSION}"
          EMBED_MODEL_URI="${EMBED_REPO_NAME}/${EMBED_NAME}:${EMBED_VERSION}"
      
          NGC_CLI_VERSION="3.41.2"
          NGC_CLI_URL="https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip"
      
          mkdir -p /opt/data
          cd /opt/data
      
          if [ ! -f .file_downloaded ]; then
            # clean up
            rm -rf compose.env ${RAG_NAME}* ${LLM_NAME}* ngc* ${EMBED_NAME}* *.json .file_downloaded
      
            # install ngc-cli
            wget --content-disposition ${NGC_CLI_URL} -O ngccli_linux.zip && unzip ngccli_linux.zip
            export PATH=`pwd`/ngc-cli:${PATH}
      
            APIKEY=""
            REG_URI="nvcr.io"
      
            if [[ "$(grep registry-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')" == *"${REG_URI}"* ]]; then
              APIKEY=$(grep registry-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            fi
      
            if [ -z "${APIKEY}" ]; then
                error_exit "No APIKEY found"
            fi
      
            # config ngc-cli
            mkdir -p ~/.ngc
      
            cat << EOF > ~/.ngc/config
            [CURRENT]
            apikey = ${APIKEY}
            format_type = ascii
            org = ${ORG_NAME}
            team = ${ORG_TEAM_NAME}
            ace = no-ace
          EOF
      
            # ngc docker login
            docker login nvcr.io -u \$oauthtoken -p ${APIKEY}
      
            # dockerhub login for general components, e.g. minio
            DOCKERHUB_URI=$(grep registry-2-uri /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            DOCKERHUB_USERNAME=$(grep registry-2-user /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
            DOCKERHUB_PASSWORD=$(grep registry-2-passwd /opt/dlvm/ovf-env.xml | sed -n 's/.*oe:value="\([^"]*\).*/\1/p')
      
            if [[ -n "${DOCKERHUB_USERNAME}" && -n "${DOCKERHUB_PASSWORD}" ]]; then
              docker login -u ${DOCKERHUB_USERNAME} -p ${DOCKERHUB_PASSWORD}
            else
              echo "Warning: DockerHub not login"
            fi
      
            # get RAG files
            ngc registry resource download-version ${RAG_URI}
      
            # get llm model
            if [ "${INFERENCE_TYPE}" = "trt" ]; then
              ngc registry model download-version ${LLM_MODEL_URI}
              chmod -R o+rX ${LLM_NAME}_v${LLM_VERSION}
              LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}_v${LLM_VERSION}"
            elif [ "${INFERENCE_TYPE}" = "vllm" ]; then
              pip install huggingface_hub
              huggingface-cli login --token ${HF_TOKEN}
              huggingface-cli download --resume-download ${HF_REPO}/${LLM_NAME} --local-dir ${LLM_NAME} --local-dir-use-symlinks False
              LLM_MODEL_FOLDER="/opt/data/${LLM_NAME}"
              cat << EOF > ${LLM_MODEL_FOLDER}/model_config.yaml 
              engine:
                model: /model-store
                enforce_eager: false
                max_context_len_to_capture: 8192
                max_num_seqs: 256
                dtype: float16
                tensor_parallel_size: ${NUM_GPU}
                gpu_memory_utilization: 0.8
          EOF
              chmod -R o+rX ${LLM_MODEL_FOLDER}
              python3 -c "import yaml, json, sys; print(json.dumps(yaml.safe_load(sys.stdin.read())))" < "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"> rag-app-text-chatbot.json
              jq '.services."nemollm-inference".image = "nvcr.io/nvidia/nim/nim_llm:24.02-day0" |
                  .services."nemollm-inference".command = "nim_vllm --model_name ${MODEL_NAME} --model_config /model-store/model_config.yaml" |
                  .services."nemollm-inference".ports += ["8000:8000"] |
                  .services."nemollm-inference".expose += ["8000"]' rag-app-text-chatbot.json > temp.json && mv temp.json rag-app-text-chatbot.json
              python3 -c "import yaml, json, sys; print(yaml.safe_dump(json.load(sys.stdin), default_flow_style=False, sort_keys=False))" < rag-app-text-chatbot.json > "${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml"
            fi
      
            # get embedding models
            ngc registry model download-version ${EMBED_MODEL_URI}
            chmod -R o+rX ${EMBED_NAME}_v${EMBED_VERSION}
      
            # config compose.env
            cat << EOF > compose.env
            export MODEL_DIRECTORY="${LLM_MODEL_FOLDER}"
            export MODEL_NAME=${LLM_NAME}
            export NUM_GPU=${NUM_GPU}
            export APP_CONFIG_FILE=/dev/null
            export EMBEDDING_MODEL_DIRECTORY="/opt/data/${EMBED_NAME}_v${EMBED_VERSION}"
            export EMBEDDING_MODEL_NAME=${EMBED_TYPE}
            export EMBEDDING_MODEL_CKPT_NAME="${EMBED_TYPE}-${EMBED_VERSION}.nemo"
            export POSTGRES_HOST_IP=pgvector_db_ip_address
            export POSTGRES_PORT_NUMBER=5432
            export POSTGRES_DB=pgvector_db_name
            export POSTGRES_USER=pgvector_db_admin
            export POSTGRES_PASSWORD=encoded_pgvector_db_admin_password
          EOF
      
            touch .file_downloaded
          fi
      
          # start NGC RAG
          docker compose -f ${RAG_NAME}_v${RAG_VERSION}/docker-compose-vectordb.yaml up -d pgvector
          source compose.env; docker compose -f ${RAG_NAME}_v${RAG_VERSION}/rag-app-text-chatbot.yaml up -d
    2. Codificare lo script cloud-init nel formato base64.
      Utilizzare uno strumento per la codifica base 64, ad esempio https://decode64base.com/ per generare la versione codificata dello script cloud-init.
    3. Distribuire la macchina virtuale di deep learning passando il valore base64 dello script cloud-init al parametro di input user-data.
  2. Se si distribuisce la macchina virtuale di deep learning utilizzando un elemento catalogo in VMware Aria Automation, fornire i dettagli del database PostgreSQL pgvector dopo aver distribuito la macchina virtuale.
    1. Distribuire la macchina virtuale di deep learning da Automation Service Broker.
    2. Passare a Utilizza > Distribuzioni > Distribuzioni e individuare la distribuzione della macchina virtuale di deep learning.
    3. Nella sezione Macchina virtuale workstation salvare i dettagli per l'accesso SSH alla macchina virtuale.
    4. Accedere alla macchina virtuale di deep learning tramite SSH utilizzando le credenziali disponibili in Automation Service Broker.
    5. Aggiungere le variabili di pgvector seguenti nel file /opt/data/compose.env:
      POSTGRES_HOST_IP=pgvector_db_ip_address
      POSTGRES_PORT_NUMBER=5432
      POSTGRES_DB=pgvector_db_name
      POSTGRES_USER=pgvector_db_admin
      POSTGRES_PASSWORD=encoded_pgvector_db_admin_password
    6. Riavviare l'applicazione multi-container NVIDIA RAG eseguendo i comandi seguenti.
      Ad esempio, per NVIDIA RAG 24.03:
      cd /opt/data
      docker compose -f rag-docker-compose_v24.03/rag-app-text-chatbot.yaml down
      docker compose -f rag-docker-compose_v24.03/docker-compose-vectordb.yaml down
      docker compose -f rag-docker-compose_v24.03/docker-compose-vectordb.yaml up -d