This document details the setup of a Jupyter Notebook that implements the ingestion process of text documents so you build a specialized knowledge base, used by an LLM to answer questions about that knowledge domain. This process is called Retrieval Augmented Generation (RAG).

The knowledge domain used for the LLM in this example use case is based on NASA history books.

Prerequisites

Verify that you have the following:

Procedure

  1. Using a pgvector instance already provisionees by DSM.

    This step only applies if you already deployed a DSM Postgres Database by following the section Deploy a Vector Database by Using a Self-Service Catalog Item in VMware Aria Automation for RAG Workloads for Private AI Ready Infrastructure for VMware Cloud Foundation, If you are not uing DSM to provision pgvector, then proceed to STEP 2.

    1. Take note of the database connection string provided by DSM, which should look like the following example. Note that the values for your environment might be different, so make sure you get the right values.
      # DB connection string example
      postgres://pgadmin:[email protected]:5432/paiftest
    2. Split the connection string into the following values which will be needed to modify defaul values from the RAG Jupyter notebook.
      # Example values extracted from the previous
      # connection string.
      DB_USER = "pgadmin"
      DB_PASSWD = "cXx27Eb2gy3gGI11pHS54AMwuS9d7R" DEFAULT_DB = "paiftest"
      DB_NAME = "paiftest"
      DB_HOST = "10.203.80.135"
  2. Deploy pgvector inside the deep learning VM.

    STEP 2 only applies when NOT using a pgvector instance deployed by DSM.

    1. Log in to the deep learning VM as the VMware user over SSH.
    2. Create the pgvector home directory.
    3. In the pgvector home directory, create an empty Docker file for the pgvector.
      # Create and get into the pgvector directory.
      mkdir pgvector; cd pgvector
      
      # Create an empty Docker compose file.
      touch docker-compose.yaml
    4. In the docker-compose.yaml file, paste the following manifest to specify the configuration of the PostgreSQL database with the pgvector RAG workload requirements.
      services:
        db:
          image: pgvector/pgvector:pg12
          ports:
            - '0.0.0.0:5432:5432'
          restart: always
          environment:
            - POSTGRES_DB=postgres
            - POSTGRES_USER=demouser
            - POSTGRES_PASSWORD=demopasswd
          volumes:
            - ./data:/var/lib/postgresql/data
      
    5. To establish communication between the PostgreSQL and PyTorch containers, launch the PostgreSQL container and create a user-defined Docker network.
      # Launch the PostgreSQL container
      sudo docker compose up -d
      
      # Jot down the IDs of the PostgreSQL & PyTorch containers
      sudo docker ps
      
      # Create the user-defined network "my-network"
      sudo docker network create my-network
      
      # Add each container to the network by providing its ID
      docker network connect my-network <container ID>
      
      # Confirm both are members of the user-definet network:
      sudo docker network inspect my-network | grep Name
      
      # Here is an example of the output you could get:
      #
      #  "Name": "my-network".      # Network name
      #    "Name": "pgvector-db-1", # PostgreSQL 
      #    "Name": "eager_johnson", # PyTorch's random name 
  3. Start a JupyterLab session.
    1. On your local machine, open an SSH tunnel to a PyTorch-type deep learning VM.

      For example, if your VM has an IP address 10.10.10.10 and user name vmware, you run a command like the following example:

      # Create a SSH tunnel to access a remote Jupyter Lab session
      # using the "vmware" name.
      ssh -L 8888:localhost:8888 [email protected]
    2. When prompted, enter the user password.
      The authenticity of host ' (10.10.10.10)' can't be established.
      ED25519 key fingerprint is SHA256:XXXXXXXXX
      This key is not known by any other names.
      Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
      Warning: Permanently added '10.10.10.10' (ED25519) to the list of known hosts.
      [email protected]'s password:
      ======================================
      Welcome to the VMware Deep Learning VM
      ======================================
      
      Resources:
       * VMware support: https://xxxxx
      
      To reinstall Nvidia driver (if needed) run:
      sudo /opt/dlvm/get-vgpu-driver.sh
      
       System information as of Sun Apr 14 03:50:22 PM UTC 2024…
      
    3. In a web browser, enter http://localhost:8888 to start a JupyterLab session.
  4. Download the NASA history books.
    1. In JupyterLab, navigate to File > New > Terminal and open a terminal tab.
    2. Clone the following GitHub repository.
      # Clone the VMware GenAI Reference architecture repository
      git clone \
      https://github.com/vmware-private-ai/VMware-generative-ai-reference-architecture.git
      
    3. Navigate to the folder where the NASA history books knowledge base is stored and run the document download script.
      # Move to the NASA history documents folder
      cd /workspace/VMware-generative-ai-reference-architecture/\
      Starter-Packs/Improved_RAG/02-KB-Documents/
      
      # Download the NASA documents
      ./get_NASA_books.sh
      
  5. Install the required Python packages.
    1. In the JupyterLab terminal, create a requirements.txt file and paste the following list of Python packages.
      vllm=="0.4.0.post1"
      transformers==4.39.3
      llama-index==0.10.27
      llama-index-agent-openai==0.2.2
      llama-index-cli==0.1.11
      llama-index-core==0.10.27
      llama-index-embeddings-huggingface==0.2.0
      llama-index-indices-managed-llama-cloud==0.1.5
      llama-index-legacy==0.9.48
      llama-index-llms-openai==0.1.14
      llama-index-llms-openai-like==0.1.3
      llama-index-multi-modal-llms-openai==0.1.4
      llama-index-program-openai==0.1.5
      llama-index-question-gen-openai==0.1.3
      llama-index-readers-file==0.1.15
      llama-index-readers-llama-parse==0.1.4
      llama-index-storage-docstore-postgres==0.1.3
      llama-index-storage-index-store-postgres==0.1.3
      llama-index-storage-kvstore-postgres==0.1.2
      llama-index-vector-stores-postgres==0.1.5
      llama-parse==0.4.0
      llamaindex-py-client==0.1.16
      psycopg2-binary==2.9.9
      
    2. Install the Python packages.
      # Install the packages listed in the requirements file
      pip install -r requirements.txt
      
  6. To serve the TheBloke/zephyr-7B-alpha-AWQ vLLM used in the RAG workload, run the following command.
    # Serve the TheBloke/zephyr-7B-alpha-AWQ LLM with vLLM
    python -m vllm.entrypoints.openai.api_server --model TheBloke/zephyr-7B-alpha-AWQ --port 8010 --enforce-eager
    Attention:

    You must keep the terminal tab open for vLLM to keep running.

  7. Create a RAG Question/Anwer system from the NASA history books.
    1. In JupyterLab, click the folder icon in the top left of the screen.
    2. Navigate to the VMware-generative-ai-reference-architecture/Starter-Packs/Improved_RAG/03-Document_ingestion folder and then double-click the Document_ingestion_pipeline.ipynb script.
    3. Modify the original document ingestion script to run on a single A100 40 GB GPU.

      The original script requires over 60 GB of VRAM.

    4. Reduce the VRAM footprint and set the PostgreSQL database name.
      Inside the script, scroll down to the cell below Global Config Setup and replace the following values.
      Table 1.
      Parameter Previous Value New Value Description
      LLM_MODEL

      "HuggingFaceH4/zephyr-7b-alpha"

      "TheBloke/zephyr-7B-alpha-AWQ". The AWQ model type is quantized, which requires a smaller VRAM footprint than the original model.
      EMB_MODEL "BAAI/bge-base-en-v1.5" "BAAI/bge-small-en-v1.5".

      Requires less memory and processing cycles to produce embeddings. However, it provides less accurate contexts for the LLM.

      DEVICE "cuda:0" "cpu" Uses CPU RAM when calculating embeddings.
      NUM_WORKERS 4 2 Reduces the number of concurrent processes allocating VRAM.
      DB_HOST "localhost" "pgvector-db-1"

      ONLY applies when using a local pgvector instance created in STEP 2.Enables Python to open connections to the pgvector store.

    5. ONLY If you are using pgvector deployed by DSM, you need to replace the following default values from the Jupyter notebook and apply the values extracted from the connection string from STEP 1.

      Variable

      Value

      DB_PASSWD

      <Check your string>

      DEFAULT_DB

      "paiftest"

      DB_NAME

      "paiftest"

      DB_HOST

      <Check you IP address or hostname>

      DB_USER

      "pgadmin"

    6. In the Jupyter Notebook toolbar, click the double-arrow to run all cells from the script.
    7. When prompted, confirm the kernel restart.
      The notebook starts to execute cell by cell. This process might take a few minutes.
    8. Scroll down to the final cell, which executes a query to the LLM.
      This query is augmented by the context retrieved from the pgvector store. In this example, the LLM is asked to respond to the question, “What are the main Hubble telescope discoveries about exoplanets?”. Verify that the response is similar to the following text:
      The Hubble Space Telescope has revealed exceedingly valuable information about hundreds of other worlds. Using Hubble, astronomers have probed an exoplanet’s atmosphere for the first time more than 20 years ago, and have even identified atmospheres that contain sodium, oxygen, carbon, hydrogen, carbon dioxide, methane, and water vapor. While most of the planets Hubble has studied to date are too hot to host life as we know it, the telescope’s observations demonstrate that the basic organic components for life can be detected and measured on planets orbiting other stars, setting the stage for more detailed studies with future observatories...

Results

You sucesfully deployed and tested the core components of a RAG workload for Private AI Ready Infrastructure for VMware Cloud Foundation.

What to do next

  • To learn more about the core elements of a RAG workload, explore the contents and output of each cell in the script.
  • To learn more about different RAG approaches and their evaluation, you can deploy a deep learning VM with at least two A100 40 GB GPUs and follow the instructions from the README files inside the rest of the folders within the Improved RAG Starter Pack repository.