After you deploy a deep learning VM in VMware Private AI Foundation with NVIDIA, the specified DL workload is not running.

Problem

You deploy a deep learning VM with a DL workload to be pre-installed at initial startup. After the deep learning VM is started, the DL workload is not carried out.

Cause

  1. The base64-encoded user-data or values of other OVF parameters, such as image-oneliner or config-json are saved or decoded incorrectly in the /opt/dlvm/dl_app.sh file. As a result, the DL workload script is not run.
  2. The vGPU driver installation failed, causing the cloud-init script passed in the user-data OVF parameter to not be run. The cloud-init script relies on the successful installation of the NVIDIA vGPU driver.

Solution

On the deep learning VM, verify whether the DL workload is installed on the virtual machine and apply a solution accordingly.

Availability of the DL Workload Solution
The DL workload components are not created on the virtual machine.
  • If you are using a cloud-init script as input to the user-data OVF parameter, verify the following values:
    • Check the script that is encoded and input as user-data.

      Make sure that #cloud-config appears on the first line and is included in the base64 equivalent.

    • Check the path parameter.

    • Check the base64 encoded string and make sure that the user-data value is correctly saved in /opt/dlvm/dl_app.sh.

  • If you are using other OVF parameters, verify the following values:

    • image-oneliner. Check the base64 encoded string and make sure that the one-line command is correctly saved in /opt/dlvm/dl_app.sh.

    • config-json. Check the base64 encoded string and make sure that the Docker compose file and config.json, if provided, are correctly saved in /root/docker-compose.yaml and /root/.docker/config.json.

For information about the OVF parameters of the latest deep learning VM image, see OVF Properties of Deep Learning VMs.

The DL workload components are created but the workload is not running.
  • Check the error messages in /var/log/vgpu-install.log.

  • If you are using a cloud-init script as input to the user-data OVF parameter, check if the NVIDIA vGPU driver is installed and is working correctly. The cloud-init script is not run if the NVIDIA vGPU driver installation is unsuccessful.