Designing a system for inference with LLMs requires consideration of the hardware, particularly GPUs and potentially CPUs.

vSphere optimizes the parallel processing architecture of GPUs for AI computing in virtualized environments, such as deep learning, which involve intensive mathematical operations on large datasets. In vSphere, you can distributed tasks across multiple GPU cores, relying on parallelization to improve GPU performance and streamline inference processes. vSphere also optimizes GPU use by supporting direct access to GPU resources from virtual machines through technologies, such as NVIDIA vGPU (time-sliced or MIG) or VMDirect Path I/O.

Consider these guidelines:

  • Large Memory Capacity: To accommodate the size of LLMs, choose GPUs with large memory capacities. For instance, models like Mistral-7b served by vLLM might consume almost 37 GB of VRAM. Larger models like Mistral 8×7B might require 7-8 times more VRAM.
  • GPUs that support Brain Floating Point 16 (BF16) are recommended because they provide an optimal balance between performance and precision. You can use BF16 for faster computations for the extensive processing demands of large language models. You can achieve quicker training and inference times without a significant sacrifice in accuracy.
Table 1. Examples GPUs for Inference
GPU Architecture Memory Usage
NVIDIA GPUs NVIDIA H100 Tensor Core GPU Hopper Up to 80 GB HBM3
  • Upper range of LLM sizes exceeding the 30 billion parameters range
  • High-performance inference with large batch sizes
NVIDIA A100 Tensor Core GPU Ampere 40 GB or 80 GB HBM2e
  • High-performance inference
  • Middle-range LLMs (between 7 and 13 billion parameters) and embedding models
  • Versatile deployment options
NVIDIA L40 GPU Ada Lovelace 48 GB GDDR6X
  • Middle-range LLMs (between 7 and 13 billion parameters) and embedding models
  • Workstations and edge deployments
  • Real-time inference with lower power consumption

Table 2. Design Decisions on Accelerators for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-ACCELERATE-001

Select GPUs with high memory bandwidth

AI workloads require high memory bandwidth to efficiently handle large amounts of data. Look for GPUs with high memory bandwidth specifications.

  • The cost of the solution is increased.
  • GPU choice might be limited.

AIR-ACCELERATE-002

Select GPUs with large memory capacity.

To handle efficiently LLMs, select GPUs equipped with substantial memory capacities. LLMs containing billions of parameters demand significant GPU memory resources for model fine-tuning and inference.

  • The cost of the solution is increased.
  • GPU choice might be limited.

AIR-ACCELERATE-003

Evaluate and compare compute performance of the available options of GPUs.

Assess the GPU's compute performance based on metrics like CUDA cores (for NVIDIA GPUs) or stream processors (for AMD GPUs). Higher compute performance provide support for faster model training and inference, particularly beneficial for complex AI tasks.

  • The cost of the solution is increased.
  • GPU choice might be limited.

AIR-ACCELERATE-004

Evaluate cooling and power efficiency of GPUs.

To manage the strain large language models place on GPUs, prioritize systems with efficient cooling and power management to mitigate high power consumption and heat generation.

You must select server platforms focused on GPU.