Hardware infrastructure requirements for AI workloads depend on the specific task, dataset size, model complexity, or performance expectations.
The following example configuration provides optimal configuration for fine-tuning and serving large language models (LLMs) which matches with NVIDIA DGX solutions. Because the requirements of your organization might be different, contact your OEM to determine the proper solution.
Category |
Hardware |
Description |
Example Optimal Configuration (Based on NVIDIA DGX) |
---|---|---|---|
CPU |
Intel |
Latest Intel Xeon 4th Gen (Sapphire Rapids) recommended, 3rd Gen (Ice Lake) acceptable, with a balance between CPU Frequency and number of cores. The latest Intel Gen offers advanced features related to AI/ML such as Intel AMX (Advanced Matrix Extensions) and support for DDR5 and CXL (Compute Express Link). Use Peripheral Component Interconnect Express (PCIe) Gen5 (recommended) or PCIe Gen4 (acceptable) for faster interconnects. |
2 x Intel Xeon (Sapphire Rapids or later) |
AMD EPYC |
Latest AMD EPYC 4th Gen (Genoa) recommended, 3rd Gen (Milan) acceptable with a balance between CPU Frequency and number of cores. EPYC CPUs offer a high core count, exceptional memory bandwidth, and support for multi-socket configurations. They are suitable for both AI/ML and LLM workloads. Use PCIe Gen5 (recommended) or PCIe Gen4 (acceptable) for faster interconnects, . |
2 x AMD EPYC (Genoa or later) |
|
Memory |
DDR5 |
Faster memory with higher bandwidth can reduce data transfer bottlenecks and enable faster access to the large datasets involved in AI/ML tasks. Additionally, the increased memory density provided by DDR5 allows for larger models and more extensive training datasets to be stored in memory, which can improve the overall performance and efficiency of AI/ML algorithms. |
2 TB RAM per node, according to the configuration |
GPU |
NVIDIA: H100, H100 NVL, A100, L40s |
NVIDIA GPUs with compute capacity greater or equal to 8.0 are essential for LLM training. The support for bfloat16 in these GPUs balances precision and range, aiding in training neural networks efficiently without losing accuracy. NVLink enables efficient GPU-to-GPU communication and memory sharing, while NVSwitch enables large-scale GPU collaboration across multiple servers, facilitating the training and deployment of advanced AI models on very big datasets. |
|
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-COMPUTE-001 |
Select servers with CPUs with a high number of cores. |
To optimize computational efficiency and minimize the need for scaling out by adding more nodes, consider scaling up the CPU core count in each server. By choosing CPUs with a high number of cores, you can effectively handle multiple inference threads simultaneously. This approach maximizes hardware utilization and enhances the capacity to manage parallel tasks, leading to improved performance and resource utilization in inference workloads |
High-end CPUs might increase the overall cost of the solution. |
AIR-COMPUTE-002 |
Select a fast-access memory. |
Minimal latency for data retrieval is crucial for real-time inference applications. Increased latency reduces inference performance and give a poor user experience. |
Re-purposing available servers might not be a feasible option and overall cost of the solution might increase. |
AIR-COMPUTE-003 |
Select CPUs with Advanced Vector Extensions (AVX, AVX2, or AVX-512). |
CPUs with support for AVX or AVX2 can improve performance in deep learning tasks by accelerating vector operations. |
Re-purposing available servers might not be a feasible option and overall cost of the solution might increase. |