Designing a system for inference with LLMs requires consideration of the hardware, particularly GPUs and potentially CPUs.
vSphere optimizes the parallel processing architecture of GPUs for AI computing in virtualized environments, such as deep learning, which involve intensive mathematical operations on large datasets. In vSphere, you can distributed tasks across multiple GPU cores, relying on parallelization to improve GPU performance and streamline inference processes. vSphere also optimizes GPU use by supporting direct access to GPU resources from virtual machines through technologies, such as NVIDIA vGPU (time-sliced or MIG) or VMDirect Path I/O.
Consider these guidelines:
- Large Memory Capacity: To accommodate the size of LLMs, choose GPUs with large memory capacities. For instance, models like Mistral-7b served by vLLM might consume almost 37 GB of VRAM. Larger models like Mistral 8×7B might require 7-8 times more VRAM.
- GPUs that support Brain Floating Point 16 (BF16) are recommended because they provide an optimal balance between performance and precision. You can use BF16 for faster computations for the extensive processing demands of large language models. You can achieve quicker training and inference times without a significant sacrifice in accuracy.
GPU | Architecture | Memory | Usage |
---|---|---|---|
NVIDIA GPUs NVIDIA H100 Tensor Core GPU | Hopper | Up to 80 GB HBM3 |
|
NVIDIA A100 Tensor Core GPU | Ampere | 40 GB or 80 GB HBM2e |
|
NVIDIA L40 GPU | Ada Lovelace | 48 GB GDDR6X |
|
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-ACCELERATE-001 |
Select GPUs with high memory bandwidth |
AI workloads require high memory bandwidth to efficiently handle large amounts of data. Look for GPUs with high memory bandwidth specifications. |
|
AIR-ACCELERATE-002 |
Select GPUs with large memory capacity. |
To handle efficiently LLMs, select GPUs equipped with substantial memory capacities. LLMs containing billions of parameters demand significant GPU memory resources for model fine-tuning and inference. |
|
AIR-ACCELERATE-003 |
Evaluate and compare compute performance of the available options of GPUs. |
Assess the GPU's compute performance based on metrics like CUDA cores (for NVIDIA GPUs) or stream processors (for AMD GPUs). Higher compute performance provide support for faster model training and inference, particularly beneficial for complex AI tasks. |
|
AIR-ACCELERATE-004 |
Evaluate cooling and power efficiency of GPUs. |
To manage the strain large language models place on GPUs, prioritize systems with efficient cooling and power management to mitigate high power consumption and heat generation. |
You must select server platforms focused on GPU. |