Accelerators Design for Private AI Ready Infrastructurefor VMware Cloud Foundation

Designing a system for inference with LLMs requires consideration of the hardware, particularly GPUs and potentially CPUs.

vSphere optimizes the parallel processing architecture of GPUs for AI computing in virtualized environments, such as deep learning, which involve intensive mathematical operations on large datasets. In vSphere, you can distributed tasks across multiple GPU cores, relying on parallelization to improve GPU performance and streamline inference processes. vSphere also optimizes GPU use by supporting direct access to GPU resources from virtual machines through technologies, such as NVIDIA vGPU (time-sliced or MIG) or VMDirect Path I/O.

Consider these guidelines:

Large Memory Capacity: To accommodate the size of LLMs, choose GPUs with large memory capacities. For instance, models like Mistral-7b served by vLLM might consume almost 37 GB of VRAM. Larger models like Mistral 8×7B might require 7-8 times more VRAM.
GPUs that support Brain Floating Point 16 (BF16) are recommended because they provide an optimal balance between performance and precision. You can use BF16 for faster computations for the extensive processing demands of large language models. You can achieve quicker training and inference times without a significant sacrifice in accuracy.

Table 1. Examples GPUs for Inference
GPU	Architecture	Memory	Usage
NVIDIA GPUs NVIDIA H100 Tensor Core GPU	Hopper	Up to 80 GB HBM3	Upper range of LLM sizes exceeding the 30 billion parameters range High-performance inference with large batch sizes
NVIDIA A100 Tensor Core GPU	Ampere	40 GB or 80 GB HBM2e	High-performance inference Middle-range LLMs (between 7 and 13 billion parameters) and embedding models Versatile deployment options
NVIDIA L40 GPU	Ada Lovelace	48 GB GDDR6X	Middle-range LLMs (between 7 and 13 billion parameters) and embedding models Workstations and edge deployments Real-time inference with lower power consumption

Table 2. Design Decisions on Accelerators for Private AI Ready Infrastructure for VMware Cloud Foundation
Decision ID	Design Decision	Design Justification	Design Implication
AIR-ACCELERATE-001	Select GPUs with high memory bandwidth	AI workloads require high memory bandwidth to efficiently handle large amounts of data. Look for GPUs with high memory bandwidth specifications.	The cost of the solution is increased. GPU choice might be limited.
AIR-ACCELERATE-002	Select GPUs with large memory capacity.	To handle efficiently LLMs, select GPUs equipped with substantial memory capacities. LLMs containing billions of parameters demand significant GPU memory resources for model fine-tuning and inference.	The cost of the solution is increased. GPU choice might be limited.
AIR-ACCELERATE-003	Evaluate and compare compute performance of the available options of GPUs.	Assess the GPU's compute performance based on metrics like CUDA cores (for NVIDIA GPUs) or stream processors (for AMD GPUs). Higher compute performance provide support for faster model training and inference, particularly beneficial for complex AI tasks.	The cost of the solution is increased. GPU choice might be limited.
AIR-ACCELERATE-004	Evaluate cooling and power efficiency of GPUs.	To manage the strain large language models place on GPUs, prioritize systems with efficient cooling and power management to mitigate high power consumption and heat generation.	You must select server platforms focused on GPU.