DL Workload Automation Is Not Performed

After you deploy a deep learning VM in VMware Private AI Foundation with NVIDIA, the specified DL workload is not running.

Problem

You deploy a deep learning VM with a DL workload to be pre-installed at initial startup. After the deep learning VM is started, the DL workload is not carried out.

Cause

The base64-encoded user-data or values of other OVF parameters, such as image-oneliner or config-json are saved or decoded incorrectly in the /opt/dlvm/dl_app.sh file. As a result, the DL workload script is not run.
The vGPU driver installation failed, causing the cloud-init script passed in the user-data OVF parameter to not be run. The cloud-init script relies on the successful installation of the NVIDIA vGPU driver.

Solution

On the deep learning VM, verify whether the DL workload is installed on the virtual machine and apply a solution accordingly.


Availability of the DL Workload	Solution
The DL workload components are not created on the virtual machine.	If you are using a cloud-init script as input to the `user-data` OVF parameter, verify the following values: Check the script that is encoded and input as `user-data`. Make sure that `#cloud-config` appears on the first line and is included in the base64 equivalent. Check the `path` parameter. Check the base64 encoded string and make sure that the `user-data` value is correctly saved in /opt/dlvm/dl_app.sh. If you are using other OVF parameters, verify the following values: `image-oneliner`. Check the base64 encoded string and make sure that the one-line command is correctly saved in /opt/dlvm/dl_app.sh. `config-json`. Check the base64 encoded string and make sure that the Docker compose file and config.json, if provided, are correctly saved in /root/docker-compose.yaml and /root/.docker/config.json. For information about the OVF parameters of the latest deep learning VM image, see OVF Properties of Deep Learning VMs.
The DL workload components are created but the workload is not running.	Check the error messages in /var/log/vgpu-install.log. If you are using a cloud-init script as input to the `user-data` OVF parameter, check if the NVIDIA vGPU driver is installed and is working correctly. The cloud-init script is not run if the NVIDIA vGPU driver installation is unsuccessful.