Hardware infrastructure requirements for AI workloads depend on the specific task, dataset size, model complexity, or performance expectations.

The following example configuration provides optimal configuration for fine-tuning and serving large language models (LLMs) which matches with NVIDIA DGX solutions. Because the requirements of your organization might be different, contact your OEM to determine the proper solution.

Table 1. Example of Optimal Compute Hardware Choices

Category

Hardware

Description

Example Optimal Configuration (Based on NVIDIA DGX)

CPU

Intel

VMware Compatibility Guide - Intel Xeon

Latest Intel Xeon 4th Gen (Sapphire Rapids) recommended, 3rd Gen (Ice Lake) acceptable, with a balance between CPU Frequency and number of cores. The latest Intel Gen offers advanced features related to AI/ML such as Intel AMX (Advanced Matrix Extensions) and support for DDR5 and CXL (Compute Express Link). Use Peripheral Component Interconnect Express (PCIe) Gen5 (recommended) or PCIe Gen4 (acceptable) for faster interconnects.

2 x Intel Xeon (Sapphire Rapids or later)

AMD EPYC

VMware Compatibility Guide - AMD EPYC

Latest AMD EPYC 4th Gen (Genoa) recommended, 3rd Gen (Milan) acceptable with a balance between CPU Frequency and number of cores. EPYC CPUs offer a high core count, exceptional memory bandwidth, and support for multi-socket configurations. They are suitable for both AI/ML and LLM workloads. Use PCIe Gen5 (recommended) or PCIe Gen4 (acceptable) for faster interconnects, .

2 x AMD EPYC (Genoa or later)

Memory

DDR5

Faster memory with higher bandwidth can reduce data transfer bottlenecks and enable faster access to the large datasets involved in AI/ML tasks. Additionally, the increased memory density provided by DDR5 allows for larger models and more extensive training datasets to be stored in memory, which can improve the overall performance and efficiency of AI/ML algorithms.

2 TB RAM per node, according to the configuration

GPU

NVIDIA: H100, H100 NVL, A100, L40s

VMware Compatibility Guide - GPUS

NVIDIA GPUs with compute capacity greater or equal to 8.0 are essential for LLM training. The support for bfloat16 in these GPUs balances precision and range, aiding in training neural networks efficiently without losing accuracy.

NVLink enables efficient GPU-to-GPU communication and memory sharing, while NVSwitch enables large-scale GPU collaboration across multiple servers, facilitating the training and deployment of advanced AI models on very big datasets.

  • 8 x NVIDIA H100 GPUs (80 GB) for models above 40B parameters
  • 4 x NVIDIA H100 GPUs (80 GB) for models less than 40B parameters

Table 2. Design Decisions for Compute Configuration for Private AI Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-COMPUTE-001

Select servers with CPUs with a high number of cores.

To optimize computational efficiency and minimize the need for scaling out by adding more nodes, consider scaling up the CPU core count in each server. By choosing CPUs with a high number of cores, you can effectively handle multiple inference threads simultaneously. This approach maximizes hardware utilization and enhances the capacity to manage parallel tasks, leading to improved performance and resource utilization in inference workloads

High-end CPUs might increase the overall cost of the solution.

AIR-COMPUTE-002

Select a fast-access memory.

Minimal latency for data retrieval is crucial for real-time inference applications. Increased latency reduces inference performance and give a poor user experience.

Re-purposing available servers might not be a feasible option and overall cost of the solution might increase.

AIR-COMPUTE-003

Select CPUs with Advanced Vector Extensions (AVX, AVX2, or AVX-512).

CPUs with support for AVX or AVX2 can improve performance in deep learning tasks by accelerating vector operations.

Re-purposing available servers might not be a feasible option and overall cost of the solution might increase.