After you deploy a deep learning VM in VMware Private AI Foundation with NVIDIA, the specified DL workload is not running.
Problem
You deploy a deep learning VM with a DL workload to be pre-installed at initial startup. After the deep learning VM is started, the DL workload is not carried out.
Cause
- The base64-encoded
user-data
or values of other OVF parameters, such asimage-oneliner
orconfig-json
are saved or decoded incorrectly in the /opt/dlvm/dl_app.sh file. As a result, the DL workload script is not run. - The vGPU driver installation failed, causing the cloud-init script passed in the
user-data
OVF parameter to not be run. The cloud-init script relies on the successful installation of the NVIDIA vGPU driver.
Solution
On the deep learning VM, verify whether the DL workload is installed on the virtual machine and apply a solution accordingly.
Availability of the DL Workload | Solution |
---|---|
The DL workload components are not created on the virtual machine. |
For information about the OVF parameters of the latest deep learning VM image, see OVF Properties of Deep Learning VMs. |
The DL workload components are created but the workload is not running. |
|